Statistical Progamming and Modelling with Montserrat Meteorological Data¶
Curiosty Project¶
Researcher and CUNY City Tech Graduated Alumni: Le' Sean Roberts
Contact:
- Email: lesean.roberts85@gmail.com
- Phone: (718) 559-7671 or (868) 383-9658
Date: [24/06/2025]
Abstract¶
This project concerns exploration of the applications of data wrangling, exploratory data analysis (EDA) techniques, statistical analysis, stochastic models and machine learning methods on historical weather data to extract valuable insights and patterns. The advantages of daily and hourly meteorological data will be leveraged to apply the prior mentioned tools and subjects.
Introduction¶
Meteorological data, with its inherent variability and complex patterns, presents an ideal playground for the application of statistical and stochastic applications. These mathematical tools allows one to delve into the intricacies of weather phenomena, uncovering hidden trends, making accurate predictions, and gaining deeper insights into climate systems.
Statistical analysis forms the foundation for comprehending meteorological data. Primitively, descriptive statistics can be employed, being the view of key characteristics such as mean, median, mode, standard deviation (and variance). Such measures provide a quantitative overview of the data's central tendency and dispersion. Also, distribution investigation is possible with tools such as measures of skewness and kurtosis, along with histograms and quantile-quantile plots. For the case of continuous attributes, the use of Pearson correlation can provide insights on possible linearity associations between attributes.
Inferential statistics takes the analysis further by drawing conclusions about a population (or period of weather data) based on a sample. Hypothesis testing allows for assessing the significance of observed differences or relationships. For example, there's ability to test whether the average temperature in a particular region has increased over the past century, or whether there's significant difference between two periods. Prior development in statistical analysis (descriptive statistics, skewness, kurtosis, and others) to be a premature guide on the types of inferential statistics tools to be applied. Data is to be studied, and not to be making loose assumptions upon it.
While statistical methods provide valuable insights, they often fall short in capturing inherent randomness and temporal dependence present in meteorological data. Stochastic processes, on the other hand, are mathematical models that describe the evolution of a system over time., incorporating deterministic and random components.
One of the commonly applied stochastic processes in meteorology is the autoregressive (AR) model. This model assumes that the current value of a variable is a linear function of its past values, along with a random error term. AR models are particularly useful for forecasting time series data. In general, processes that extend the AR model may be included in this project. Statistical and stochastic processes have numerous application in meteorology, including:
Climate Modeling: These techniques are essential for developing complex climate models that simulate the Earth's climate system and predict future climate change.
Climate Change Detection and Attribution: Statistical techniques are applied to detect trends in climate data and attribute these trends to human activities or natural variability.
Extreme Event Analysis: These methods help analyze extreme weather events, such as hurricanes, floods, and heatwaves, and assess their potential impacts.
Weather Forecasting: Statistical and stochastic models are used to improve the accuracy of weather forecasts, especially for short-term predictions.
This project assimilates meteorological data focused on Montserrat, an island in the Caribbean of the Lesser Antilles. Units for such meteorological data is expressed in the metric system; measures of length or depth are in millimeters however, and km/h for the case of windspeed. The time format is of ISO 8601.
Repositories from various government agencies such as the National Oceanic and Atmospheric Administration (NOAA), National Weather Service (NWS), and National Centers for Environmental Information (NCEI), together with revered resources such as Kaggle and Open-Meteo API, proved to be valuable resources for this project.
None of the prior mentioned endeavors are possible without competent data assimilation and wrangling processes and practices. On many occasions data wrangling can be applied towards various interests in modeling and exploratory data analysis. The more knowledge and vigor one has, such results in the need for dedication and unorthodox pursuits, being essential features for revolutionary development w.r.t. to time and resources. Data assimilation and data wrangling can be quite a challenge w.r.t. the types of programming languages applied. This project leverages the Python programming language, requiring much more technical development and patience than a conventional statistical development language such as R. This project is heavier on the programming side than on the mathematical side because there's emphasis on getting substantial development, rather than clique ideologies and luxurious "espieglerie". On many occasions throughout the process, various errors or concerns are encountered due to the natural being of applied data files; cases of mixed data types, bad entries, missing data, instances appearing as a particular data type but being something else; values incompatible with particular models for convergence, conventions triggering warnings, deprecated customs, and so forth.
Furthermore, this project also leverages various types of machine learning methods, such as supervised learning (common regression, and classification), unsupervised learning (histogram-based outlier score and local outlier factor), and ensemble learning (random forests and XGBoost). Such a course can be a respectable substitute for numerical weather prediction (NWP) models when computational complexity is not desired, and time is limited for research development. However, large data sets are generally required, which may at times sully the prestige of machine learning models compared to NWP models. Poor models are natural; decent or good/great models require much ingenuity sans creating underlying bias or fraudulence.
Methodology¶
A conventional idea about statistical programming or data science, the following steps are statutory:
Data Collection and Cleaning
Exploratory Data Analysis
Feature Engineering
Model Development
Validation and Testing
Dashboard Development
However, in real practice there may be a back-and-forth process among various steps depending on what pursuits there are. Classification, building classes or feature engineering can arise on multiple occasions to develop robust models. This project often implements such unconventional activity.
Declaration¶
This project is designed with the intent to provide readers with the tools and structure necessary to critically investigate and assess its methodology. It does not aim to create superficial or overly simplified outcomes, but rather fosters a genuine opportunity for analysis and constructive critique. By utilizing the Python programming language, the project enables progressive and continuous development, empowering a global community of researchers, regardless of socioeconomic background or time constraints.
While the project is an elementary step, it represents a foundational development crucial for advancing serious data analysis work. It is important to recognize that data analysis cannot be confined to classroom norms and customs, as not everything can be fully taught in an academic setting. This project assumes that readers already possess a strong foundational knowledge of basic data analysis, as it is not intended as a teaching tool. Instead, it is developed with the expectation that serious, real-world applications in commerce and development are the ultimate goals. My intention is not to teach, as I do not have the time or resources to engage in teaching roles. This project is a step toward advancing the field and should be viewed as such.
Daily Meteorological Data¶
Daily meteorological/climate data, a collection of measurements taken over a 24-hour period, provides a vital foundation for understanding climate trends and predicting future atmospheric conditions. This data encompasses variables such as temperature, precipitation, wind speed and direction, atmospheric pressure, etc.
The collection of daily meteorological data relies on a network of weather stations equipped with specialized instruments. Thermometers measure temperature, rain gauges collect precipitation, anemometers gauge wind speed and direction, barometers measure atmospheric pressure, and so forth. This data is then processed and analyzed by meteorologists to extract meaningful insights.
The applications of daily meteorological data are diverse and far-reaching. In the realm of weather forecasting, this data serves as the cornerstone for predicting future weather patterns, enabling individuals and organizations to plan activities and make informed decisions. Climate studies rely on long-term analysis of daily meteorological data to identify trends, understand climate variability, and assess the impacts of climate change. In agriculture, daily meteorological data plays a crucial role in optimizing planting and harvesting schedules, irrigation practices, and pest control strategies. Additionally, the energy sector relies on daily meteorological data to forecast energy demand, ensuring efficient grid management and resource allocation. Meteorological data, collected at various temporal resolutions, provides invaluable insights into weather patterns. While hourly data offers detailed information about short-term weather events, daily data is often more suitable for analyzing long-term climate trends. This distinction arises from several key factors.
Firstly, the sheer volume of hourly data can be overwhelming, particularly when dealing with extensive datasets spanning multiple years. With hourly data the frequency of observations can obscure underlying patterns and make it difficult to identify significant trends. In contrast, daily data, aggregated from hourly observations, reduces the noise, and allows for a more focused analysis of long-term climate signals.
Secondly, daily data often incorporates averaging or smoothing techniques, which can help to mitigate the impact of short-term weather fluctuations. These techniques reduce the variability of the data, making it easier to discern underlying trends and patterns. Hourly data, on the other hand, may be more susceptible to the influence of transient weather events, such as thunderstorms or brief temperature spikes, which can obscure long-term climate signals.
Thirdly, climate signals, including seasonal variations, long-term warming or cooling trends, and decadal oscillations, are typically more pronounced at the daily timescale. Hourly data can be more sensitive to short-term weather phenomena, which may mask these larger-scale patterns. By focusing on daily averages, researchers can better isolate and analyze the long-term climate signals embedded within the data. Moreover, many statistical methods used in climate analysis, such as correlation analysis and regression modeling, are better suited to daily data due to its reduced variability and the potential for more robust statistical relationships. These methods can help to identify meaningful connections between climate variables and underlying drivers, providing valuable insights into climate dynamics.
Finally, daily data is generally less computationally intensive to store, and process compared to hourly data, which can be particularly important for large datasets spanning multiple decades. This efficiency allows researchers to work with larger datasets and conduct more complex analyses. While hourly meteorological data is essential for understanding short-term weather events, daily data offers a more effective lens for examining long-term climate trends and variability. By reducing noise, focusing on broader patterns, and facilitating statistical analysis, daily data provides a valuable resource for climate scientists and researchers seeking to understand the Earth’s climate system. Daily meteorological or climate data is an invaluable resource that underpins our understanding of the Earth’s atmosphere and its complex systems. By collecting, processing, and analyzing this data, we gain valuable insights into weather patterns, climate trends, and the impacts of environmental factors. This information is essential for informed decision-making, sustainable development, and the well-being of our planet.
The daily data applied stems from the Open-Meteo API. The data ranges from year 1990 to year 2025.
Data Wrangling: Data Assimilation, Data Frames and Cleaning¶
Assimilating data from sources such as repositories, databases and APIs is common practice today, requiring basic scripts to retrieve data with respect to unique parameters.
Data cleaning generally concerns identifying and correcting errors, inconsistencies, or missing values. This may involve tasks such as removing duplicates, imputing missing data, etc.
import openmeteo_requests
import pandas as pd
import requests_cache
from retry_requests import retry
# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)
# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
"latitude": 16.7425,
"longitude": -62.1874,
"start_date": "1980-01-08",
"end_date": "2025-06-24",
"daily": ["temperature_2m_mean", "temperature_2m_max", "temperature_2m_min", "apparent_temperature_mean", "apparent_temperature_max", "apparent_temperature_min", "wind_speed_10m_max", "et0_fao_evapotranspiration", "rain_sum", "dew_point_2m_max", "dew_point_2m_min", "surface_pressure_max", "surface_pressure_min", "pressure_msl_max", "pressure_msl_min", "relative_humidity_2m_max", "relative_humidity_2m_min", "wet_bulb_temperature_2m_max", "wet_bulb_temperature_2m_min", "vapour_pressure_deficit_max", "soil_temperature_0_to_7cm_mean"],
"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)
# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")
# Process daily data. The order of variables needs to be the same as requested.
daily = response.Daily()
daily_temperature_2m_mean = daily.Variables(0).ValuesAsNumpy()
daily_temperature_2m_max = daily.Variables(1).ValuesAsNumpy()
daily_temperature_2m_min = daily.Variables(2).ValuesAsNumpy()
daily_apparent_temperature_mean = daily.Variables(3).ValuesAsNumpy()
daily_apparent_temperature_max = daily.Variables(4).ValuesAsNumpy()
daily_apparent_temperature_min = daily.Variables(5).ValuesAsNumpy()
daily_wind_speed_10m_max = daily.Variables(6).ValuesAsNumpy()
daily_et0_fao_evapotranspiration = daily.Variables(7).ValuesAsNumpy()
daily_rain_sum = daily.Variables(8).ValuesAsNumpy()
daily_dew_point_2m_max = daily.Variables(9).ValuesAsNumpy()
daily_dew_point_2m_min = daily.Variables(10).ValuesAsNumpy()
daily_surface_pressure_max = daily.Variables(11).ValuesAsNumpy()
daily_surface_pressure_min = daily.Variables(12).ValuesAsNumpy()
daily_pressure_msl_max = daily.Variables(13).ValuesAsNumpy()
daily_pressure_msl_min = daily.Variables(14).ValuesAsNumpy()
daily_relative_humidity_2m_max = daily.Variables(15).ValuesAsNumpy()
daily_relative_humidity_2m_min = daily.Variables(16).ValuesAsNumpy()
daily_wet_bulb_temperature_2m_max = daily.Variables(17).ValuesAsNumpy()
daily_wet_bulb_temperature_2m_min = daily.Variables(18).ValuesAsNumpy()
daily_vapour_pressure_deficit_max = daily.Variables(19).ValuesAsNumpy()
daily_soil_temperature_0_to_7cm_mean = daily.Variables(20).ValuesAsNumpy()
daily_data = {"date": pd.date_range(
start = pd.to_datetime(daily.Time(), unit = "s", utc = True),
end = pd.to_datetime(daily.TimeEnd(), unit = "s", utc = True),
freq = pd.Timedelta(seconds = daily.Interval()),
inclusive = "left"
)}
daily_data["temperature_2m_mean"] = daily_temperature_2m_mean
daily_data["temperature_2m_max"] = daily_temperature_2m_max
daily_data["temperature_2m_min"] = daily_temperature_2m_min
daily_data["apparent_temperature_mean"] = daily_apparent_temperature_mean
daily_data["apparent_temperature_max"] = daily_apparent_temperature_max
daily_data["apparent_temperature_min"] = daily_apparent_temperature_min
daily_data["wind_speed_10m_max"] = daily_wind_speed_10m_max
daily_data["et0_fao_evapotranspiration"] = daily_et0_fao_evapotranspiration
daily_data["rain_sum"] = daily_rain_sum
daily_data["dew_point_2m_max"] = daily_dew_point_2m_max
daily_data["dew_point_2m_min"] = daily_dew_point_2m_min
daily_data["surface_pressure_max"] = daily_surface_pressure_max
daily_data["surface_pressure_min"] = daily_surface_pressure_min
daily_data["pressure_msl_max"] = daily_pressure_msl_max
daily_data["pressure_msl_min"] = daily_pressure_msl_min
daily_data["relative_humidity_2m_max"] = daily_relative_humidity_2m_max
daily_data["relative_humidity_2m_min"] = daily_relative_humidity_2m_min
daily_data["wet_bulb_temperature_2m_max"] = daily_wet_bulb_temperature_2m_max
daily_data["wet_bulb_temperature_2m_min"] = daily_wet_bulb_temperature_2m_min
daily_data["vapour_pressure_deficit_max"] = daily_vapour_pressure_deficit_max
daily_data["soil_temperature_0_to_7cm_mean"] = daily_soil_temperature_0_to_7cm_mean
daily_dataframe = pd.DataFrame(data = daily_data)
print(daily_dataframe)
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
date temperature_2m_mean temperature_2m_max \
0 1980-01-08 04:00:00+00:00 23.374834 24.141499
1 1980-01-09 04:00:00+00:00 23.264421 23.891499
2 1980-01-10 04:00:00+00:00 22.322748 23.191502
3 1980-01-11 04:00:00+00:00 22.587332 23.341499
4 1980-01-12 04:00:00+00:00 21.306086 22.091499
... ... ... ...
16600 2025-06-20 04:00:00+00:00 25.351082 26.199001
16601 2025-06-21 04:00:00+00:00 25.390665 25.898998
16602 2025-06-22 04:00:00+00:00 25.317749 25.898998
16603 2025-06-23 04:00:00+00:00 NaN 25.848999
16604 2025-06-24 04:00:00+00:00 NaN NaN
temperature_2m_min apparent_temperature_mean \
0 22.191502 22.092840
1 22.191502 22.358231
2 21.341499 21.067259
3 21.841499 19.905577
4 20.541500 19.145449
... ... ...
16600 24.848999 25.104864
16601 24.699001 25.419016
16602 24.449001 24.848602
16603 25.098999 NaN
16604 NaN NaN
apparent_temperature_max apparent_temperature_min wind_speed_10m_max \
0 23.520189 20.983297 37.212578
1 23.697132 21.602598 36.896046
2 22.371422 19.988932 35.654541
3 20.436180 18.984425 42.072281
4 19.637054 18.262983 40.104061
... ... ... ...
16600 27.231419 23.766788 40.882591
16601 27.573139 24.278919 38.166790
16602 26.219694 23.004978 44.039349
16603 25.357843 23.626095 42.990990
16604 NaN NaN NaN
et0_fao_evapotranspiration rain_sum ... surface_pressure_max \
0 3.982460 1.5 ... 983.794922
1 3.946293 0.8 ... 984.397400
2 3.259691 2.7 ... 983.913513
3 4.604709 0.5 ... 983.572449
4 2.766571 5.7 ... 982.082092
... ... ... ... ...
16600 4.981394 0.1 ... 983.506775
16601 5.119689 0.0 ... 983.344971
16602 5.130907 1.0 ... 982.319397
16603 NaN NaN ... 981.898865
16604 NaN NaN ... NaN
surface_pressure_min pressure_msl_max pressure_msl_min \
0 980.577454 1019.299988 1016.099976
1 981.443359 1019.900024 1016.900024
2 980.805786 1019.599976 1016.299988
3 980.355164 1019.099976 1015.900024
4 978.976501 1017.799988 1014.599976
... ... ... ...
16600 981.255981 1018.700012 1016.500000
16601 980.240479 1018.700012 1015.400024
16602 979.411743 1017.500000 1014.500000
16603 979.643860 1017.099976 1014.799988
16604 NaN NaN NaN
relative_humidity_2m_max relative_humidity_2m_min \
0 87.652779 70.725937
1 87.906815 73.156029
2 90.619431 71.578697
3 81.800613 61.149487
4 89.427284 78.321884
... ... ...
16600 86.541199 70.866669
16601 85.219734 72.591751
16602 86.767601 72.591751
16603 84.229759 75.320984
16604 NaN NaN
wet_bulb_temperature_2m_max wet_bulb_temperature_2m_min \
0 21.027277 20.169138
1 20.914402 20.337797
2 20.636232 18.998484
3 19.724335 17.843048
4 19.959215 19.202456
... ... ...
16600 23.118631 21.683819
16601 22.751518 22.099451
16602 22.906918 21.904879
16603 23.149427 22.411777
16604 NaN NaN
vapour_pressure_deficit_max soil_temperature_0_to_7cm_mean
0 0.880710 24.816500
1 0.795568 24.729010
2 0.783625 24.678999
3 1.107534 24.629000
4 0.576288 24.578997
... ... ...
16600 0.984500 26.217749
16601 0.912614 26.238586
16602 0.912614 26.267754
16603 0.821694 NaN
16604 NaN NaN
[16605 rows x 22 columns]
Daily Meteorological Attributes¶
temperature_2m_mean °C: Mean daily air temperature at 2 meters above ground.
temperature_2m_max and temperature_2m_min °C: Maximum and minimum daily air temperature at 2 meters above ground.
apparent_temperature_max and apparent_temperature_min °C: Mean, Maximum and minimum daily apparent temperature.
rain_sum (mm): Sum of daily rain
wind_speed_10m_max and wind_gusts_10m_max (km/h (mph, m/s, knots)): Maximum wind speed and gusts on a day.
et0_fao_evapotranspiration (mm): Daily sum of ETO Reference Evapotranspiration of a well-watered grass field per day in Megajoules.
surface_pressure_max and surface_pressure_min (hPa): Surface pressure
pressure_msl_max and pressure_msl_min (hPa): Maximum and minimum atmospheric air pressure reduced to mean sea level (msl) or pressure at surface. Typically pressure on mean sea level is used in meteorology.
relative_humidity_2m_max and relative_humidity_2m_min (%): Maximum and minimum relative humidity at 2 meters above ground.
wet_bulb_temperature_2m_max and wet_bulb_temperature_2m_min (°C): maximum and minimum lowest temperature that can be reached by evaporating water into the air at a constant pressure.
vapour_pressure_deficit (kPa): Maximum Vapor Pressure Deificit (VPD) in kilopascal (kPa). For high VPD (>1.6), water transpiration of plants increases. For low VPD (<0.4), transpiration decreases.
soil_temperature_0_to_7cm_mean (°C): Mean average temperature of different soil levels below ground.
dew_point_2m_max and dew_point_2m_min (°C): Maximum and minimum dew point temperature at 2 meters above ground.
# Dropping missing values
# Observing attribute data properties
daily_dataframe_clean = daily_dataframe.dropna()
daily_dataframe_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 16603 entries, 0 to 16602 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 16603 non-null datetime64[ns, UTC] 1 temperature_2m_mean 16603 non-null float32 2 temperature_2m_max 16603 non-null float32 3 temperature_2m_min 16603 non-null float32 4 apparent_temperature_mean 16603 non-null float32 5 apparent_temperature_max 16603 non-null float32 6 apparent_temperature_min 16603 non-null float32 7 wind_speed_10m_max 16603 non-null float32 8 et0_fao_evapotranspiration 16603 non-null float32 9 rain_sum 16603 non-null float32 10 dew_point_2m_max 16603 non-null float32 11 dew_point_2m_min 16603 non-null float32 12 surface_pressure_max 16603 non-null float32 13 surface_pressure_min 16603 non-null float32 14 pressure_msl_max 16603 non-null float32 15 pressure_msl_min 16603 non-null float32 16 relative_humidity_2m_max 16603 non-null float32 17 relative_humidity_2m_min 16603 non-null float32 18 wet_bulb_temperature_2m_max 16603 non-null float32 19 wet_bulb_temperature_2m_min 16603 non-null float32 20 vapour_pressure_deficit_max 16603 non-null float32 21 soil_temperature_0_to_7cm_mean 16603 non-null float32 dtypes: datetime64[ns, UTC](1), float32(21) memory usage: 1.6 MB
daily_dataframe_clean.isna().sum()
date 0 temperature_2m_mean 0 temperature_2m_max 0 temperature_2m_min 0 apparent_temperature_mean 0 apparent_temperature_max 0 apparent_temperature_min 0 wind_speed_10m_max 0 et0_fao_evapotranspiration 0 rain_sum 0 dew_point_2m_max 0 dew_point_2m_min 0 surface_pressure_max 0 surface_pressure_min 0 pressure_msl_max 0 pressure_msl_min 0 relative_humidity_2m_max 0 relative_humidity_2m_min 0 wet_bulb_temperature_2m_max 0 wet_bulb_temperature_2m_min 0 vapour_pressure_deficit_max 0 soil_temperature_0_to_7cm_mean 0 dtype: int64
Summary Statistics¶
Summary statistics (also called descriptive statistics) are a set of numbers that describe the central tendency, spread, and shape of your data. They serve to comprehend the key features of your data quickly.
Measures of Central Tendency: These tell you where the center of your data lies
Mean: The average of all values.
Median: The middle value when data is sorted.
Mode: The most frequently occurring value. Highly meaningful for categorical or ordinal data, but generally not the case for continuous attributes since the spread of such data is generally large without case groupings.
Measures of Spread: These tell you where the center of your data lies
Range: The difference between the highest and lowest values.
Interquartile Range (IQR): The range of the middle 50% of the data.
Variance: The average squared difference from the mean.
Standard Deviation: The square root of the variance, giving a measure of spread in the same units as the data.
Measures of Shape: These tell you where the center of your data lies
Skewness: Measures how symmetric your data is.
Positive skew: tail on the right
Negative skew: tail on the left
The baseline distribution in many (but not all) cases is normal distribution. A skewness value of 0 conveys symmetry. Realistic data doesn’t have such exact value, but may come close to it if a high level of symmetry exists.
Kurtosis: Measures how peaked or flat your data is.
High kurtosis: very peaked
Low kurtosis: very flat
The baseline distribution in many (but not all) cases is normal distribution. A kurtosis value of 3 conveys the data to be normal. Realistic data doesn’t have such exact value, but may come close to it if a high level of normality exists.
Summary statistics are essential tools in the data analyst's toolkit, providing a concise overview of the key characteristics of a dataset. By calculating numerical measures that describe the central tendency, dispersion, and shape of the data, summary statistics help analysts quickly understand the data's underlying patterns and make informed decisions.
Measures of central tendency, such as the mean, median, and mode, provide information about the typical or representative value of the data. The mean represents the average value, the median indicates the middle value when the data is sorted, and the mode identifies the most frequently occurring value. These statistics help analysts understand the central location of the data and identify any potential biases or skewness.
Measures of dispersion, such as the range, variance, and standard deviation, quantify the spread or variability of the data. The range indicates the overall spread of the data, while the variance and standard deviation measure the average squared deviation from the mean. These statistics help analysts understand how much the data points vary from the central tendency and identify outliers or unusual values.
Measures of shape, such as skewness and kurtosis, provide insights into the overall distribution of the data. Skewness measures the asymmetry of the distribution, indicating whether the tail on one side is longer than the other. Kurtosis measures the peakedness or flatness of the distribution, revealing whether the data has heavy tails or a sharp peak.
Summary statistics are invaluable for understanding the basic properties of a dataset and for making informed decisions. They can be used to identify outliers, compare different groups of data, and assess the overall distribution of the data. By effectively using summary statistics, analysts can gain valuable insights into their data and make data-driven decisions with confidence.
# Drop the first column, since 'date' or datetime format isn't meaningful for summary statistics
daily_data_sans_first_col = daily_dataframe_clean.iloc[:, 1:]
daily_summary_stats = daily_data_sans_first_col.describe()
print(daily_summary_stats)
temperature_2m_mean temperature_2m_max temperature_2m_min \
count 16603.000000 16603.000000 16603.000000
mean 24.273718 25.039673 23.379768
std 1.146305 1.359918 1.125049
min 20.558165 21.441502 18.841499
25% 23.376921 23.991501 22.541500
50% 24.374832 25.141499 23.441502
75% 25.106709 25.841499 24.191502
max 27.792749 29.699001 27.199001
apparent_temperature_mean apparent_temperature_max \
count 16603.000000 16603.000000
mean 24.702103 26.499727
std 2.199432 2.520320
min 17.356228 18.977034
25% 23.038151 24.679927
50% 24.931297 26.687723
75% 26.306145 28.271816
max 32.091747 34.299778
apparent_temperature_min wind_speed_10m_max \
count 16603.000000 16603.000000
mean 23.352652 30.715977
std 2.104526 6.617004
min 15.962875 6.792466
25% 21.796345 26.649727
50% 23.566624 31.035257
75% 24.888628 35.068369
max 31.027546 93.806084
et0_fao_evapotranspiration rain_sum dew_point_2m_max ... \
count 16603.000000 16603.000000 16603.000000 ...
mean 4.468894 2.086117 20.885281 ...
std 0.739465 4.837135 1.480140 ...
min 1.299162 0.000000 13.591500 ...
25% 3.995539 0.100000 19.799000 ...
50% 4.517640 0.800000 21.191502 ...
75% 4.963439 2.100000 22.091499 ...
max 7.167190 151.499985 24.199001 ...
surface_pressure_max surface_pressure_min pressure_msl_max \
count 16603.000000 16603.000000 16603.000000
mean 981.201477 978.281189 1016.515198
std 1.829245 1.903635 1.925053
min 969.477356 956.304626 1004.400024
25% 980.028717 977.115845 1015.299988
50% 981.344055 978.492126 1016.700012
75% 982.463745 979.609558 1017.799988
max 986.992188 983.832581 1022.500000
pressure_msl_min relative_humidity_2m_max relative_humidity_2m_min \
count 16603.000000 16603.000000 16603.000000
mean 1013.511536 84.635811 72.861374
std 1.997123 4.958839 6.816984
min 990.700012 56.597607 35.666039
25% 1012.299988 81.713306 69.922222
50% 1013.700012 85.474670 74.697205
75% 1014.900024 88.505695 77.612133
max 1019.200012 96.143120 87.194046
wet_bulb_temperature_2m_max wet_bulb_temperature_2m_min \
count 16603.000000 16603.000000
mean 21.826445 20.951443
std 1.289275 1.404822
min 16.469954 15.306818
25% 20.785082 19.948974
50% 22.077534 21.231716
75% 22.850239 22.076393
max 24.967825 24.356148
vapour_pressure_deficit_max soil_temperature_0_to_7cm_mean
count 16603.000000 16603.000000
mean 0.863740 26.025419
std 0.251260 1.748632
min 0.366002 22.646914
25% 0.700161 24.816500
50% 0.791879 25.829008
75% 0.945766 26.641506
max 2.260390 35.667747
[8 rows x 21 columns]
Skew and Kurtosis¶
Skew and kurtosis are two statistical measures that provide valuable insights into the shape and characteristics of a probability distribution. While the mean and standard deviation offer central tendency and dispersion, skew and kurtosis delve into the asymmetry and peakedness of a dataset, respectively.
Skew measures the asymmetry of a distribution. A positive skew indicates that the tail to the right (larger values) is longer or heavier than the tail to the left. Conversely, a negative skew suggests that the tail to the left (smaller values) is longer. A zero skew implies a symmetric distribution. Skew is often visualized as a distortion of the normal distribution curve, with the peak shifted to one side and the tail extended in the opposite direction.
Kurtosis measures the peakedness or flatness of a distribution relative to a normal distribution. A high kurtosis, also known as leptokurtosis, indicates a distribution with heavy tails and a sharp peak. This means that there is a higher probability of extreme values occurring. In contrast, a low kurtosis, or platykurtosis, suggests a distribution with light tails and a flat peak, implying a lower likelihood of extreme events. A mesokurtic distribution has a kurtosis similar to a normal distribution.
Understanding skew and kurtosis is essential for data analysis and interpretation. For instance, a positively skewed distribution might suggest that there are a few very large values that are pulling the mean to the right, while a negatively skewed distribution could indicate the presence of a few very small values. Kurtosis can help identify outliers or unusual patterns in a dataset.
import scipy.stats as stats
#Skew and kurtosis
skewness = daily_data_sans_first_col.skew()
kurtosis = daily_data_sans_first_col.kurtosis()
print("Skewness:")
print(skewness)
print("\nKurtosis:")
print(kurtosis)
Skewness: temperature_2m_mean -0.080265 temperature_2m_max 0.287729 temperature_2m_min -0.120366 apparent_temperature_mean -0.181386 apparent_temperature_max -0.097145 apparent_temperature_min -0.182777 wind_speed_10m_max -0.031380 et0_fao_evapotranspiration -0.260995 rain_sum 9.741498 dew_point_2m_max -0.675294 dew_point_2m_min -0.967386 surface_pressure_max -0.437890 surface_pressure_min -0.847299 pressure_msl_max -0.411078 pressure_msl_min -0.827830 relative_humidity_2m_max -0.977159 relative_humidity_2m_min -1.182832 wet_bulb_temperature_2m_max -0.426486 wet_bulb_temperature_2m_min -0.623472 vapour_pressure_deficit_max 1.489722 soil_temperature_0_to_7cm_mean 1.471327 dtype: float32 Kurtosis: temperature_2m_mean -0.575250 temperature_2m_max -0.045151 temperature_2m_min -0.356732 apparent_temperature_mean -0.512940 apparent_temperature_max -0.415506 apparent_temperature_min -0.437406 wind_speed_10m_max 1.502347 et0_fao_evapotranspiration 0.357547 rain_sum 166.088089 dew_point_2m_max 0.136888 dew_point_2m_min 0.824719 surface_pressure_max 0.598648 surface_pressure_min 3.302744 pressure_msl_max 0.517645 pressure_msl_min 3.165608 relative_humidity_2m_max 1.211777 relative_humidity_2m_min 1.198043 wet_bulb_temperature_2m_max -0.528162 wet_bulb_temperature_2m_min -0.190336 vapour_pressure_deficit_max 2.257868 soil_temperature_0_to_7cm_mean 3.004556 dtype: float32
Histograms and Quantile-Quantile Plots¶
Now, to provide a visual display of the distributions to acquire and apply visual judgement. Histograms provide pictures or the shapes of the distributions, while Q-Q plots provide view of the divergence or disaparity from (in our case) normal distribution.
NOTE: the benchmark or ideal distribution doesn't have to be normal.
import matplotlib.pyplot as plt
import seaborn as sns
# Get the column names
column_names = daily_data_sans_first_col.columns
print(column_names)
column_names_list = column_names.tolist()
# Calculating the number of ros and columns for subplots.
num_cols = 3 # 3 columns
num_rows = (len(column_names_list) + num_cols - 1) // num_cols
# Calculating the number of rows
# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))
# Flatten if required.
if num_rows > 1:
axes = axes.flatten()
# Plot the histograms
for i, col in enumerate(column_names_list):
sns.histplot(data = daily_data_sans_first_col[col], ax = axes[i], kde = True)
axes[i].set_title(f'Histogram of {col}')
axes[i].set_xlabel('Value')
axes[i].set_ylabel('Frequency')
axes[i].grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['temperature_2m_mean', 'temperature_2m_max', 'temperature_2m_min',
'apparent_temperature_mean', 'apparent_temperature_max',
'apparent_temperature_min', 'wind_speed_10m_max',
'et0_fao_evapotranspiration', 'rain_sum', 'dew_point_2m_max',
'dew_point_2m_min', 'surface_pressure_max', 'surface_pressure_min',
'pressure_msl_max', 'pressure_msl_min', 'relative_humidity_2m_max',
'relative_humidity_2m_min', 'wet_bulb_temperature_2m_max',
'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max',
'soil_temperature_0_to_7cm_mean'],
dtype='object')
# Get the column names
column_names_no_date = daily_data_sans_first_col.columns
print(column_names_no_date)
column_names_list_no_date = column_names_no_date.tolist()
# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))
# Flatten if required.
if num_rows > 1:
axes = axes.flatten()
# Plot the histograms
for i, col in enumerate(column_names_list_no_date):
ax = axes[i]
stats.probplot(daily_data_sans_first_col[col], dist = "norm", plot = ax)
ax.set_title(f'QQ Plot of {col}')
ax.grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['temperature_2m_mean', 'temperature_2m_max', 'temperature_2m_min',
'apparent_temperature_mean', 'apparent_temperature_max',
'apparent_temperature_min', 'wind_speed_10m_max',
'et0_fao_evapotranspiration', 'rain_sum', 'dew_point_2m_max',
'dew_point_2m_min', 'surface_pressure_max', 'surface_pressure_min',
'pressure_msl_max', 'pressure_msl_min', 'relative_humidity_2m_max',
'relative_humidity_2m_min', 'wet_bulb_temperature_2m_max',
'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max',
'soil_temperature_0_to_7cm_mean'],
dtype='object')
Scatterplots Encompassing all Physical Variables¶
Observation of scatter plots is a customary preliminary method of model determination involving the observed variables.
# Creating PairGrid with three columns
g = sns.PairGrid(daily_data_sans_first_col)
# Mapping scatterplots
g.map(sns.scatterplot)
# Adjusting columns specification
plt.subplots_adjust(left = 0.1, right = 0.9, top = 0.9, bottom = 0.1, wspace = 0.3, hspace = 0.3)
# Showing plot
plt.show()
Such prior scatter plots provide detail on whether linearity exists among variable pairs. Scatter plotting is a preliminary investigation to determine whether predictive models like (multi)linear regression will be practical. Some scatter plots fall into a linear-type orientation due to observed general trend in data orientation. Else, there are scatter plots whose data are highly condensed or clustered with general shapes.
NOTE: the "perfectly" linear scatter plots in the main diagonal are to be ignored because such cases are variables plotted against themselves, which isn't meaningful.
Correlation and Correlation Heatmaps¶
Correlation measures the strength and direction of the linear relationship between two variables. In other words, it quantifies how well a change in one variable can be associated with a corresponding change in another, based on a straight-line relationship.
Correlation refers to the statistical relationship or association between two variables. When two variables are correlated, changes in one variable tend to be accompanied by changes in the other variable.
Correlation is typically measured using a correlation coefficient, which quantifies the strength and direction of the relationship between the variables. The Pearson correlation coefficient ranges from -1 to 1:
A correlation coefficient of 1 indicates a perfect positive correlation, meaning that as one variable increases, the other variable also increases proportionally.
A correlation coefficient of -1 indicates a perfect negative correlation, meaning that as one variable increases, the other variable decreases proportionally.
A correlation coefficient of 0 indicates no correlation, meaning that there is no systematic relationship between the variables.
As well, for the Pearson measure a high correlation (regardless of sign) value conveys a possible linear relationship between the variables being compared.
Correlation does not imply causation, meaning that even if two variables are correlated, it does not necessarily mean that changes in one variable cause changes in the other variable. Correlation simply quantifies the degree to which two variables vary together. A crude but effective example, "the number of firefighters in operations service corresponds to the number of hazardous fires occuring....however, more firefighters don't cause more fires."
The Pearson correlation coefficient ($r$) is the most common measure of linearity. It ranges from -1 to +1:
$r = +1$: Perfect positive linear relationship. All points lie on an upward-sloping straight line.
$r = -1$: Perfect negative linear relationship. All points lie on a downward-sloping straight line.
$r = 0$: No linear relationship; the data points do not form a recognizable line.
# Applying Pearson correlation to the data set
daily_pearson_corr = daily_dataframe_clean.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(daily_pearson_corr, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation heatmap of Montserrat Daily Meteorological Data')
plt.savefig('daily_heatmap.pdf', format = 'pdf')
plt.show()
The correlation heat map appears to be consistent with the scatter plot matrix from earlier. Pearson correlation conveys levels of association among attributes and/or level of possible linearity. For highly correlated data pairs (> 0.8) the scatter plots will conform to a line with low instantaneous rate of change (with positive or negative slope); lower correlation, say, 0.8 to 0.4 in magnitude would appear elliptical; a correlation of 0 would appear circular, or have some irregular shape, or have clusters, or at least two axes that can well span the data dispersion (where one direction isn't dominant over the other). NOTE: one should not assume that natural real attributes must have linear relationships.
Time Series¶
The prior correlation heatmap conveys very low association between time/date and the physical variables. If the time or date data is not highly correlated with the meteorological data, such indicates that the temporal aspect of the data does not show a strong linear relationship with those meteorological variables. Yet, there should not be the assumption that general data pairs are naturally linear.
A time series is a set of data points that occur in successive order over a period of time. The data applied is reflective of such.
Time series analysis concerns observing possible residing trend, seasonality or cycle properties. The time series data is decomposed to uniquely identify the possible existence of such characteristics.
A time series is a sequence of data points collected over time. Mathematically, it can be represented as:
$$Y(t)=T(t)+S(t)+\epsilon(t)$$$Y(t)$: The observed value of the time series at time t.
$T(t)$: The trend component, representing the long-term direction of the series.
$S(t)$: The seasonal component, representing periodic fluctuations within a fixed time period.
$\epsilon(t)$: The residual or noise component, representing the random fluctuations that cannot be explained by the trend or seasonal components.
Trend Component The trend component can be modeled using various functions, such as:
Linear: $T(t) = \alpha\,t + \beta$
Polynomial: $T(t) = \alpha_0 + \alpha_1(t) + \alpha_2(t^2) + ... + \alpha_n(t^n)$
Exponential: $T(t) = \alpha\,\text {e}^{\beta(t)}$
Logistic: $T(t) = \frac{\alpha}{1+ \beta\,\text {e}^{(-\gamma(t))}}$
Seasonal Component The seasonal component can be modeled using periodic functions, such as:
Sine/Cosine: $S(t) = \alpha\,\text {sin}(\omega\,t + \phi) + \beta\,\text {cos}(\omega,t + \phi)$
Fourier series: $S(t) = a_0 + Σ[a\,n\,cos(2πnt/T) + b\,n\,sin(2πnt/T)]$
Residual Component The residual component is often assumed to be white noise, meaning it has:
Zero mean: $E[\epsilon(t)] = 0$
Constant variance: $Var[\epsilon(t)] = \sigma^2$
No autocorrelation: $\text {Cov}[\epsilon(t), epsilon(s)] = 0\,\,\,\, \text{for t} \neq s$
Stationarity A time series is said to be stationary if its statistical properties (mean, variance, autocorrelation) remain constant over time. Stationarity is a common assumption in many time series models.
Time Series with LOWESS Smoothing¶
LOWESS (Locally Weighted Scatterplot Smoothing) is a non-parametric regression technique used to smooth data in time series or scatterplots. It is particularly useful for capturing trends in data without assuming a specific functional form, making it ideal for exploratory data analysis.
Local Regression: LOWESS performs a series of localized linear regressions across the data. For each point in the dataset, it fits a weighted linear regression using a subset of nearby data points.
Weighted Fitting: Points closer to the target point (in terms of x-values) are given more weight in the local regression. The weight decreases as the distance between the target point and neighboring points increases, often using a tricube weighting function.
Smoothing Parameter (frac): This controls the span or bandwidth of the smoothing window --
A small frac (close to 0) uses fewer neighboring points for each local fit, resulting in a curve that closely follows the data (less smoothing).
A large frac (closer to 1) uses more points for each local fit, producing a smoother curve that captures broader trends but may miss finer details.
Flexible Smoothing: Unlike parametric models that assume a specific relationship (e.g., linear, quadratic), LOWESS adapts to the data. It is especially useful when the true relationship between variables is unknown or non-linear.
Handles Non-Linear Trends: LOWESS can reveal complex patterns, such as oscillations or sudden shifts in time series data, that linear models cannot easily capture.
Local Behavior: Since LOWESS is local to each point, it can adapt to different patterns in different parts of the dataset, making it more flexible than global smoothing methods like polynomial fitting.
No Assumptions About Distribution: As a non-parametric method, LOWESS doesn’t require assumptions about the underlying distribution of the data (e.g., normality), making it a robust choice for noisy or irregular data.
When plotting time series data, raw data may contain a lot of noise, making it difficult to identify general trends. LOWESS helps to:
Smooth Out Short-Term Fluctuations: It filters out high-frequency noise, leaving a clearer picture of long-term trends.
Identify Underlying Patterns: It can reveal the shape and nature of the trend, even in the presence of noisy or irregular data.
A Local Polynomial Regression¶
For each point $t$ in the time series, LOWESS performs a local regression using a subset of the data. Such local polynomial regression can be expressed mathematically as:
$$\hat{Y}(t) = \sum_{j=1}^{n} W_{j}(t) \cdot Y(t_{j}) \cdot \frac{(t_{j} - t)^{p}}{h^{p}}$$Where $Y(t_j)$ is the value of the time series at some point $j$.
$W_j(t)$ is a weight assigned to the the observation $Y(t_j)$ based on its distance from $t$.
$h$ is the bandwidth parameter that determines the size of the local neighbourhood.
$p$ is the degree of the local polynomial (commonly 1 or 2).
Weight Function
The weights $W_j(t)$ are computed using a weight function, commonly a tricube weight function defined as:
$$ W_{j}(t) = \begin{cases} (1 - |d|^3)^3 & \text{if } |d| < 1 \\ 0 & \text{if } |d| \geq 1 \end{cases} $$where $d = \frac{|t_{j} - t|}{h}$.
Iterative Fitting
LOWESS can also be implemented in an iterative manner, refining the fit by iterating through the residuals and re-weighting the observations based on their distance from the local fit.
import statsmodels.api as sm
# Create a new DataFrame meteo_data_numeric to avoid modifying the original meteo_data
daily_data_numeric = daily_dataframe_clean.copy()
# Remove rows with invalid 'date' values in the new DataFrame
daily_data_numeric = daily_data_numeric.dropna(subset=['date'])
# Convert 'date' to Unix timestamps in the new DataFrame
daily_data_numeric['date_numeric'] = daily_data_numeric['date'].apply(lambda x: x.timestamp())
# Setting the style of the seaborn plots
sns.set_style('whitegrid')
# Defining the variables to plot against "date_numeric"
variables_to_plot = daily_data_numeric.columns.drop(['date', 'date_numeric']).tolist()
# Define a color palette
color_palette = sns.color_palette("husl", len(variables_to_plot))
# Plotting each variable against 'date_numeric' with smoothing in the new DataFrame
for variable, color in zip(variables_to_plot, color_palette):
plt.figure(figsize=(12, 6))
# Plot the original data using Seaborn lineplot
sns.lineplot(data=daily_data_numeric, x='date_numeric', y=variable, color=color, label='Original')
# Apply LOWESS smoothing
smoothed = sm.nonparametric.lowess(daily_data_numeric[variable], daily_data_numeric['date_numeric'], frac=0.1)
# Plot the smoothed line
plt.plot(daily_data_numeric['date_numeric'], [point[1] for point in smoothed], color='red', linestyle='--', label='Smoothed')
# Add title, labels, and formatting
plt.title(f'Time Series Plot of {variable} with Smoothing', fontsize=14)
plt.xlabel('Date (Unix Timestamp)')
plt.ylabel(variable)
plt.xticks(rotation=45) # Rotating x-axis labels for better readability
plt.legend()
plt.tight_layout()
plt.show()
Augmented Dickey- Fuller (ADF) Test¶
From the Statsmodels package, stationarity conveys that the statistical properties of a time series, say, the mean, variance and covariance do not vary over time. Many statistical models require the series to be stationary to make effective and precise predictions. Two statistical tests would be used to check the stationarity of a time series - Augmented Dickey Fuller (“ADF”) Test and Kwiatkowski-Phillips-Schmidt-Shin (“KPSS”) Test.
Critical values are thresholds that determine whether the test statistic obtained from the ADF test is significant or not. The ADF test is commonly used to assess the stationarity of a time series data.
Here's how critical values work in this context:
Unit Root: A series with a unit root is a non-stationary time series, namely, it possesses a changining variance over time. Such a property makes the time series difficult toi analyze and model.
ADF Test Statistic: The ADF test calculates a test statistic based on the degree of non-stationarity in the time series data. This test statistic is compared against critical values to determine whether the data is stationary or non-stationary.
Null Hypothesis: The null hypothesis of the ADF test is that the time series data has a unit root, indicating it is non-stationary. The alternative hypothesis is that the data is stationary.
Critical Values: Critical values are pre-defined thresholds derived from statistical distributions, such as the Dickey-Fuller distribution. These critical values correspond to different levels of significance (e.g., 1%, 5%, 10%). They represent the values beyond which the ADF test statistic must exceed for the null hypothesis to be rejected.
If the ADF test statistic is more negative than the critical values, it provides evidence against the null hypothesis, suggesting stationarity in the data.
Conversely, if the ADF test statistic is less negative than the critical values, there's insufficient evidence to reject the null hypothesis, indicating non-stationarity in the data.
Interpretation: Typically, if the absolute value of the ADF test statistic is less negative than the critical values at a chosen significance level (e.g., 5%), then we fail to reject the null hypothesis, implying that the time series data is non-stationary. Conversely, if the absolute value of the test statistic is more negative than the critical values, we reject the null hypothesis, indicating stationarity in the data.
The ADF test is used to determine the presence of unit root in the series, and hence helps in understand if the series is stationary or not. The null and alternate hypothesis of this test are:
NULL HYPOTHESIS: The series has a unit root.
ALTERNATE HYPOTHESIS: The series has no unit root.
If the null hypothesis is failed to be rejected, this test may provide evidence that the series is non-stationary.
Autoregressive models are statistical models used for time series analysis, where present values are predicted based on a linear combination of past values. Such models assume that past behavior influences future outcomes, making them meaningful for forecasting trends and patterns in data over time (Fernando 2024). Test} ullr (ADF) test model is given by the following equa$$\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta y_{t-i} + \epsilon_t$$
Where:
$ y_t $ is the time series being tested,
$ \Delta y_t = y_t - y_{t-1} $ is the first difference of the time series,
$ t $ is the time trend (optional),
$ \alpha $ is a constant (drift term),
$ \beta t $ represents the deterministic time trend (optional),
$ \gamma $ is the coefficient for testing the presence of a unit root,
$ \delta_i $ are the coefficients for the lagged difference terms,
$ p $ is the number of lags of the differenced terms,
$ \epsilon_t $ is the white noise error term.
The hypotheses for the ADF test are as follows:
$ H_0: \gamma = 0 $ The series has a unit root, i.e., it is non-stationary;
$ H_A: \gamma < 0 $ The series is stationary
The test statistic is calculated using the $ t $-statistic of the estimated $ \gamma $:
$ \tau = \frac{\hat{\gamma}}{SE(\hat{\gamma})} $
Where:
$ \hat{\gamma} $ is the estimated coefficient for $ y_{t-1} $,
$ SE(\hat{\gamma}) $ is the standard error of $ \hat{\gamma} $.
If the test statistic $ \tau $ is more negative than the critical value, we reject the null hypothesis and conclude that the series is stationary.
If $ \tau $ is less negative than the critical value, we fail to reject the null hypothesis, implying the series has a unit root and is non-stationary.
from statsmodels.tsa.stattools import adfuller
# Initialize an empty list to store columns with p-values greater than 0.05
columns_with_high_p_values = []
# Loop through each column in the DataFrame
for column in daily_dataframe_clean.columns:
# Check if the column is constant
if daily_dataframe_clean[column].nunique() == 1:
print(f"Column '{column}' is constant and will be skipped.")
continue
# Performing ADF test on the current column.
result = adfuller(daily_dataframe_clean[column].dropna())
# Extracting ADF test results for the current column
print(f"ADF Test Results for '{column}':")
print(f"ADF Statistic: {result[0]}")
print(f"p-value: {result[1]}")
print(f"Critical Values: {result[4]}")
print("\n")
# Check if the p-value is greater than 0.05 (non-stationary)
if result[1] > 0.05:
columns_with_high_p_values.append(column)
# Create a data frame containing columns with p-values greater than 0.05 (non-stationary)
non_stationary_columns_df = daily_dataframe_clean[columns_with_high_p_values]
print("Columns with p-values greater than 0.05 (non-stationary):")
print(non_stationary_columns_df.head())
ADF Test Results for 'date':
ADF Statistic: 127.521177222858
p-value: 1.0
Critical Values: {'1%': -3.4307447795924704, '5%': -2.8617144767135985, '10%': -2.5668628695330438}
ADF Test Results for 'temperature_2m_mean':
ADF Statistic: -9.452164604817089
p-value: 4.585375734432722e-16
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}
ADF Test Results for 'temperature_2m_max':
ADF Statistic: -8.140604784014066
p-value: 1.030640432208805e-12
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}
ADF Test Results for 'temperature_2m_min':
ADF Statistic: -8.81654710030056
p-value: 1.9274119348856e-14
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}
ADF Test Results for 'apparent_temperature_mean':
ADF Statistic: -9.54169935641269
p-value: 2.715772031935339e-16
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}
ADF Test Results for 'apparent_temperature_max':
ADF Statistic: -9.331043783206056
p-value: 9.327040813917995e-16
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}
ADF Test Results for 'apparent_temperature_min':
ADF Statistic: -9.388649864148997
p-value: 6.652675031865217e-16
Critical Values: {'1%': -3.430744970348345, '5%': -2.861714561014421, '10%': -2.5668629144052084}
ADF Test Results for 'wind_speed_10m_max':
ADF Statistic: -19.007591367188702
p-value: 0.0
Critical Values: {'1%': -3.4307444938038882, '5%': -2.861714350414922, '10%': -2.566862802306002}
ADF Test Results for 'et0_fao_evapotranspiration':
ADF Statistic: -8.943862977329935
p-value: 9.099297004054091e-15
Critical Values: {'1%': -3.4307447795924704, '5%': -2.8617144767135985, '10%': -2.5668628695330438}
ADF Test Results for 'rain_sum':
ADF Statistic: -18.165064292851632
p-value: 2.4554926496016038e-30
Critical Values: {'1%': -3.4307445890207626, '5%': -2.8617143924941595, '10%': -2.5668628247041987}
ADF Test Results for 'dew_point_2m_max':
ADF Statistic: -9.386741780827915
p-value: 6.7275145125415855e-16
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}
ADF Test Results for 'dew_point_2m_min':
ADF Statistic: -9.155005530487388
p-value: 2.6242011536920892e-15
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}
ADF Test Results for 'surface_pressure_max':
ADF Statistic: -12.997314804753536
p-value: 2.732954497172046e-24
Critical Values: {'1%': -3.430744565212235, '5%': -2.861714381972446, '10%': -2.566862819103636}
ADF Test Results for 'surface_pressure_min':
ADF Statistic: -13.306442710596356
p-value: 6.88255466457685e-25
Critical Values: {'1%': -3.4307445176037983, '5%': -2.8617143609328277, '10%': -2.566862807904538}
ADF Test Results for 'pressure_msl_max':
ADF Statistic: -14.03454018893963
p-value: 3.387816589160927e-26
Critical Values: {'1%': -3.4307443986329553, '5%': -2.861714308355986, '10%': -2.566862779918612}
ADF Test Results for 'pressure_msl_min':
ADF Statistic: -12.990868429724106
p-value: 2.8143830251545577e-24
Critical Values: {'1%': -3.4307445176037983, '5%': -2.8617143609328277, '10%': -2.566862807904538}
ADF Test Results for 'relative_humidity_2m_max':
ADF Statistic: -10.04243834718041
p-value: 1.4840671835685065e-17
Critical Values: {'1%': -3.430744922642096, '5%': -2.861714539931579, '10%': -2.5668629031831025}
ADF Test Results for 'relative_humidity_2m_min':
ADF Statistic: -6.335038407097828
p-value: 2.844371020791204e-08
Critical Values: {'1%': -3.430744898793293, '5%': -2.861714529392068, '10%': -2.566862897573066}
ADF Test Results for 'wet_bulb_temperature_2m_max':
ADF Statistic: -10.193553778660885
p-value: 6.230936312873574e-18
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}
ADF Test Results for 'wet_bulb_temperature_2m_min':
ADF Statistic: -9.75580665328518
p-value: 7.795623908912306e-17
Critical Values: {'1%': -3.4307449942057917, '5%': -2.861714571557752, '10%': -2.5668629200172783}
ADF Test Results for 'vapour_pressure_deficit_max':
ADF Statistic: -4.754890939104802
p-value: 6.628746720590535e-05
Critical Values: {'1%': -3.43074494649378, '5%': -2.8617145504723633, '10%': -2.5668629087938166}
ADF Test Results for 'soil_temperature_0_to_7cm_mean':
ADF Statistic: -6.294926686120316
p-value: 3.524903027117927e-08
Critical Values: {'1%': -3.430744970348345, '5%': -2.861714561014421, '10%': -2.5668629144052084}
Columns with p-values greater than 0.05 (non-stationary):
date
0 1980-01-08 04:00:00+00:00
1 1980-01-09 04:00:00+00:00
2 1980-01-10 04:00:00+00:00
3 1980-01-11 04:00:00+00:00
4 1980-01-12 04:00:00+00:00
From the above results, excluding the 'date' index, there are no attributes with non-stationarity. Hence, co-integration analysis among attribute pairs are not possible. Co-integration concerns observing the long-term trend between two variables, to identify any possible similar behaviour. For the case of lacking non-stationarity, one needs to revert to measures like correlation.
Long-Term Forecasting¶
When it comes to long-term forecasting, there are several approaches and techniques one can apply that doesn't explicitly require the data to be stationary. Some examples:
- Facebook's Prophet: This is a powerful forecasting tool that can handle missing data and outliers. It works well with daily data and captures seasonality without needing to transform the data to be stationary.
- Multiple Linear Regression: You can use regression techniques to forecast future values based on one or more predictor variables without the need for stationarity. This approach works well when you have external factors that influence your target variable.
Forecasting with Prophet¶
Prophet is an open-source forecasting tool developed by Facebook, designed specifically for making forecasts with time series data.
KEY FEATURES OF PROPHET:
- Automatic Seasonal Adjustment:
Prophet automatically detects and accounts for yearly, weekly, and daily seasonal effects in the data. This is especially useful for datasets that show clear periodic trends.
- Flexible Trend Modeling:
Prophet can model trends that change over time, including linear and logistic growth models. This allows it to adapt to both consistent growth and more complex trend behaviors.
- Handling of Missing Data:
Prophet is robust to missing data points and can perform well even if some timestamps are missing.
- User-Friendly:
Designed to be easy to use for both novices and experienced data scientists, it requires minimal preprocessing of the data.
- Outlier Detection:
The model can identify and handle outliers, which can significantly impact forecast accuracy.
- Incorporation of Holidays:
Users can include custom holidays and special events, allowing the model to account for effects that might not be captured by the seasonal trends alone.
- Scalability:
Prophet is efficient for large datasets and can quickly fit models and generate forecasts.
MATHEMATICAL STRUCTURE OF PROPHET
Prophet decomposes a time series into three main components:
Trend Component:
Represents the long-term progression of the series.
Can be modeled as a piecewise linear or logistic growth curve. The algorithm automatically detects changes in the trend (change points).
$$g(t)=\text {piecewise linear or logistic growth function}$$Seasonal Component:
Captures periodic fluctuations in the data, which can occur yearly, weekly, or daily.
Seasonal effects are modeled using Fourier series. The number of Fourier terms can be adjusted for each seasonality.
$$s(t) = \sum_{n=1}^{N} \left[ a_n \cos\left( \frac{2 \pi n t}{T} \right) + b_n \sin\left( \frac{2 \pi n t}{T} \right) \right]$$Holiday Effects:
Incorporates the effects of holidays that can cause significant changes in the time series.
The holiday effect can be treated as an additional regressor in the model.
$$h(t) = \sum_{i=1}^{H} \delta_i \cdot I(t \in \text{holiday}_i)$$$H$ is the number of holidays.
$\delta_i$ is the effect of holiday $i$.
$I(t \in \text{holiday}_i)$ is an indicator function that is 1 if $t$ falls on holiday $i$.
Overall model represented by:
$$y(t)=g(t)+s(t)+h(t)+\epsilon_t$$$\epsilon_t$ is the error term, assumed to be normally distributed.
from prophet import Prophet
# Ensure 'date' is not set as the index
if 'date' not in daily_dataframe_clean.columns:
meteo_data.reset_index(inplace=True)
# List of variables to forecast (excluding 'date')
variables_to_forecast = daily_dataframe_clean.columns.drop('date')
# Create a function to fit the model and make predictions
def forecast_variable(variable):
# Prepare data
forecast_data = daily_dataframe_clean[['date', variable]].rename(columns={'date': 'ds', variable: 'y'})
# Remove timezone information if present
forecast_data['ds'] = forecast_data['ds'].dt.tz_localize(None) # Remove timezone
# Initialize the Prophet model
model = Prophet()
# Fit the model
model.fit(forecast_data)
# Make future predictions for the next 365 days
future = model.make_future_dataframe(periods=365)
forecast = model.predict(future)
# Plot the forecast
fig = model.plot(forecast)
plt.title(f'Long-term Forecast for {variable}')
plt.xlabel('Date')
plt.ylabel(variable)
plt.show()
return forecast
# Loop through each variable and forecast
forecasts = {}
for variable in variables_to_forecast:
forecasts[variable] = forecast_variable(variable)
22:28:54 - cmdstanpy - INFO - Chain [1] start processing 22:29:14 - cmdstanpy - INFO - Chain [1] done processing
22:29:19 - cmdstanpy - INFO - Chain [1] start processing 22:29:35 - cmdstanpy - INFO - Chain [1] done processing
22:29:40 - cmdstanpy - INFO - Chain [1] start processing 22:29:54 - cmdstanpy - INFO - Chain [1] done processing
22:29:58 - cmdstanpy - INFO - Chain [1] start processing 22:30:13 - cmdstanpy - INFO - Chain [1] done processing
22:30:17 - cmdstanpy - INFO - Chain [1] start processing 22:30:30 - cmdstanpy - INFO - Chain [1] done processing
22:30:35 - cmdstanpy - INFO - Chain [1] start processing 22:30:50 - cmdstanpy - INFO - Chain [1] done processing
22:30:55 - cmdstanpy - INFO - Chain [1] start processing 22:31:01 - cmdstanpy - INFO - Chain [1] done processing
22:31:06 - cmdstanpy - INFO - Chain [1] start processing 22:31:11 - cmdstanpy - INFO - Chain [1] done processing
22:31:16 - cmdstanpy - INFO - Chain [1] start processing 22:31:21 - cmdstanpy - INFO - Chain [1] done processing
22:31:28 - cmdstanpy - INFO - Chain [1] start processing 22:31:43 - cmdstanpy - INFO - Chain [1] done processing
22:31:47 - cmdstanpy - INFO - Chain [1] start processing 22:32:01 - cmdstanpy - INFO - Chain [1] done processing
22:32:06 - cmdstanpy - INFO - Chain [1] start processing 22:32:18 - cmdstanpy - INFO - Chain [1] done processing
22:32:23 - cmdstanpy - INFO - Chain [1] start processing 22:32:33 - cmdstanpy - INFO - Chain [1] done processing
22:32:38 - cmdstanpy - INFO - Chain [1] start processing 22:32:50 - cmdstanpy - INFO - Chain [1] done processing
22:32:55 - cmdstanpy - INFO - Chain [1] start processing 22:33:04 - cmdstanpy - INFO - Chain [1] done processing
22:33:12 - cmdstanpy - INFO - Chain [1] start processing 22:33:20 - cmdstanpy - INFO - Chain [1] done processing
22:33:24 - cmdstanpy - INFO - Chain [1] start processing 22:33:36 - cmdstanpy - INFO - Chain [1] done processing
22:33:42 - cmdstanpy - INFO - Chain [1] start processing 22:33:56 - cmdstanpy - INFO - Chain [1] done processing
22:34:01 - cmdstanpy - INFO - Chain [1] start processing 22:34:14 - cmdstanpy - INFO - Chain [1] done processing
22:34:19 - cmdstanpy - INFO - Chain [1] start processing 22:34:39 - cmdstanpy - INFO - Chain [1] done processing
22:34:46 - cmdstanpy - INFO - Chain [1] start processing 22:35:02 - cmdstanpy - INFO - Chain [1] done processing
Multilinear Regression¶
Such is a statistical method that explores the relationship between a dependent variable (target) and two or more independent variables (features or predictors) by fitting a linear equation to observed data. It aims to understand how changes in the independent variables collectively influence the dependent variable.
The values that multiply the predictors in the regression equation. These coefficients indicate the strength and direction of the relationship between each predictor and the target, holding other variables constant.
Multiple linear regression finds the best-fitting linear equation that minimizes the differences between the observed values of the dependent variable and the values predicted by the equation. This is typically done by minimizing the sum of squared errors between the predicted and actual values.
Recalling all non-time attributes to be float types, hence multilinear regression is applicable:
daily_data_sans_first_col.info()
<class 'pandas.core.frame.DataFrame'> Index: 16603 entries, 0 to 16602 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 temperature_2m_mean 16603 non-null float32 1 temperature_2m_max 16603 non-null float32 2 temperature_2m_min 16603 non-null float32 3 apparent_temperature_mean 16603 non-null float32 4 apparent_temperature_max 16603 non-null float32 5 apparent_temperature_min 16603 non-null float32 6 wind_speed_10m_max 16603 non-null float32 7 et0_fao_evapotranspiration 16603 non-null float32 8 rain_sum 16603 non-null float32 9 dew_point_2m_max 16603 non-null float32 10 dew_point_2m_min 16603 non-null float32 11 surface_pressure_max 16603 non-null float32 12 surface_pressure_min 16603 non-null float32 13 pressure_msl_max 16603 non-null float32 14 pressure_msl_min 16603 non-null float32 15 relative_humidity_2m_max 16603 non-null float32 16 relative_humidity_2m_min 16603 non-null float32 17 wet_bulb_temperature_2m_max 16603 non-null float32 18 wet_bulb_temperature_2m_min 16603 non-null float32 19 vapour_pressure_deficit_max 16603 non-null float32 20 soil_temperature_0_to_7cm_mean 16603 non-null float32 dtypes: float32(21) memory usage: 1.5 MB
Recall the Pearson Correlation Heatmap:
# Applying Pearson correlation to the data set
daily_pearson_corr = daily_dataframe_clean.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(daily_pearson_corr, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation heatmap of Montserrat Daily Meteorological Data')
plt.savefig('daily_heatmap.pdf', format = 'pdf')
plt.show()
Quantile Regression¶
One can use regression techniques to forecast future values based on one or more predictor variables without the need for stationarity. This approach works well when you have external factors that influence your target variable.
Firstly, observing the scatter plots and Pearson correlation heatmap there's evidence for high nonlinearity among variable pairs. Then again, linear regression is not the only type of regression.
Quantile Regression can effectively manage nonlinearity in relationships between variables. Unlike ordinary least squares (OLS) regression, which estimates the conditional mean of the response variable given certain predictor variables, quantile regression estimates the conditional quantiles (e.g., median, quartiles) of the response variable. This allows it to provide a more comprehensive view of the relationship between variables, particularly in the presence of nonlinearity.
Basic Quantile Regression Model:
The mathematical formulation of quantile regression (Koenker and Hallock 2001) can be defined as follows:
- Model Specification – quantile regression model for a given quantile $\tau$ where $(0 < \tau < 1)$ can be expressed in the manner of
$Q_y(\tau|X)$ is the $\tau$-quantile of the response variable (target) $y$ given the predictor variables (features) $X$.
$X$ is a vector of predictor variables.
$\beta(\tau)$ is the vector of coefficients associated with the quantile $\tau$.
- Objective Function – the estimation of the quantile regression coefficients $\beta(\tau)$ is done by minimizing the following loss function:
$n$ being the number of observation,
$\rho_{\tau}(u)$ as the quantile loss function, define by:
$$\rho_\tau(u) = \begin{cases} \tau u & \text{if } u \geq 0 \\(\tau - 1) u & \text{if } u < 0 \end{cases}$$Such above function provides different penalties for positive and negative residuals based on the quantiles of interest.
Feature Selection¶
This process concerns identifying the features that are most influential upon the target of concern. In this project Random Forest regressor is applied. Random Forest is an ensemble learning method, commonly applying decision trees.
Random Forest Feature Selection:
There is the challenge to identify attributes (or features or predictors) that influence a target variable without (cognitive) bias. Feature selection techniques can be applied for determining the importance or relevance of features in predictive modeling.
The random forest (regressor) will be applied for feature selection. Firstly, Random Forest is a popular ensemble learning algorithm that combines the predictions of multiple decision trees to improve accuracy and reduce overfitting. It's a versatile method used for both classification and regression tasks. The algorithm creates multiple decision trees by randomly sampling the training data with replacement. This process is known as bootstrapping (in a bagging sense). Each tree is trained on a different subset of the data. For each decision node in a tree, a random subset of features is selected. This helps to prevent overfitting by reducing the correlation between trees. Once all trees are trained, their predictions are combined to make a final decision. For classification tasks, a majority vote is used. For regression tasks, the average of the predictions is taken.
One of the key strengths of Random Forest is its ability to reduce overfitting. By creating multiple decision trees and averaging their predictions, the algorithm effectively mitigates the risk of any individual tree becoming overly specialized to the training data. This ensemble approach helps to generalize the model and improve its performance on unseen data.
Moreover, Random Forest consistently outperforms individual decision trees, especially when dealing with complex datasets. The combination of multiple diverse models leads to a more accurate and robust prediction.
Another valuable aspect of Random Forest is its capability to assess feature importance. By analyzing the frequency with which features are selected in the decision trees, the algorithm can provide insights into which variables are most influential in the prediction process. This information is invaluable for understanding the underlying relationships and making informed decisions about feature selection or engineering.
Random Forest is also known for its robustness to noise and outliers. The ensemble nature of the algorithm helps to reduce the impact of individual noisy data points, making it more resilient to variations in the data.
Furthermore, Random Forest is highly scalable, capable of handling large datasets and high-dimensional feature spaces efficiently. This scalability makes it suitable for a wide range of applications. Such includes geophysics, bioological sciences, medical diagnosis, financial forecasting, and sports.
Visuals for Random Forest Regressor:
- Prediction Line Visualization:
Shows how the Random Forest fits the data, highlighting the average behavior of multiple decision trees.
- Tree Structure Visualization:
Displays the structure of an individual decision tree used in the regression to understand the splits.
- Feature Importance Plot:
Shows the importance of each feature in the Random Forest regression model.
An example visualization of how a Random Forest functions (for one and multiple features):
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import plot_tree
import seaborn as sns
# Generate synthetic regression data
np.random.seed(42)
X = np.random.rand(100, 1) * 10 # 100 data points, one feature
y = 2 * np.sin(X).ravel() + np.random.normal(0, 0.5, X.shape[0])
# Train Random Forest Regressor
regressor = RandomForestRegressor(n_estimators=15, random_state=42)
regressor.fit(X, y)
# 1. Prediction Line Visualization
plt.figure(figsize=(10, 6))
plt.scatter(X, y, color="blue", label="Data")
X_test = np.linspace(0, 10, 500).reshape(-1, 1)
y_pred = regressor.predict(X_test)
plt.plot(X_test, y_pred, color="red", label="Random Forest Prediction")
plt.title('Random Forest Regressor - Prediction Line')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
# Add space between plots
plt.subplots_adjust(hspace=0.4)
# 2. Tree Structure Visualization
plt.figure(figsize=(20, 10))
plot_tree(regressor.estimators_[0], filled=True, rounded=True, feature_names=['Feature 1'])
plt.title("Random Forest Regressor - Tree 1")
plt.show()
# Add space between plots
plt.subplots_adjust(hspace=0.4)
# 3. Feature Importance Plot
feature_importances = regressor.feature_importances_
features = ['Feature 1']
plt.figure(figsize=(8, 6))
sns.barplot(x=features, y=feature_importances)
plt.title('Feature Importances in Random Forest Regressor')
plt.show()
<Figure size 640x480 with 0 Axes>
<Figure size 640x480 with 0 Axes>
Explanation of the Visuals
- Prediction Line Visualization:
This plot shows how the Random Forest Regressor fits the data by averaging the outputs of individual decision trees. The red line represents the predicted values, while the blue points are the actual data points.
- Tree Structure Visualization:
This uses the plot_tree function to visualize an individual tree from the Random Forest, showing the splits and conditions used for regression.
- Feature Importance Plot:
Displays the importance of each feature in the Random Forest model, indicating how much each feature contributes to the predictions.
For the above display, the feature importance value is extreme and in favor of the feature because the target or dependent variable is structured upon the feature. Concerning the scatter plot with the random forest prediction, the random forest prediction will "converge" to the orientation of the data due to the direct construction of the target from the feature.
Multiple Features:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import plot_tree
import seaborn as sns
# Generate synthetic regression data with multiple features
np.random.seed(42)
X = np.random.rand(100, 5) * 10 # 100 data points, five features
y = (
2 * np.sin(X[:, 0]) +
3 * np.cos(X[:, 1]) +
1.5 * X[:, 2] +
0.5 * X[:, 3] ** 2 +
np.random.normal(0, 0.5, X.shape[0]) # Adding noise
)
# Train Random Forest Regressor
regressor = RandomForestRegressor(n_estimators=20, random_state=42)
regressor.fit(X, y)
# 1. Prediction Line Visualization (using the first feature)
plt.figure(figsize=(10, 6))
X_test = np.linspace(0, 10, 500).reshape(-1, 1)
y_pred = regressor.predict(np.concatenate([X_test, np.zeros((500, 4))], axis=1)) # Keeping other features constant
plt.scatter(X[:, 0], y, color="blue", label="Data")
plt.plot(X_test, y_pred, color="red", label="Random Forest Prediction")
plt.title('Random Forest Regressor - Prediction Line (First Feature)')
plt.xlabel('Feature 1')
plt.ylabel('Target')
plt.legend()
plt.show()
# 2. Tree Structure Visualization (showing the first tree)
plt.figure(figsize=(20, 10))
plot_tree(regressor.estimators_[0], filled=True, rounded=True, feature_names=[f'Feature {i+1}' for i in range(X.shape[1])])
plt.title("Random Forest Regressor - Tree 1")
plt.show()
# 3. Feature Importance Plot
feature_importances = regressor.feature_importances_
features = [f'Feature {i+1}' for i in range(X.shape[1])] # Generate feature names dynamically
plt.figure(figsize=(10, 6))
# Assign features to hue and set legend to False
sns.barplot(x=features, y=feature_importances, palette='viridis', hue=features)
plt.title('Feature Importances in Random Forest Regressor')
plt.ylabel('Importance Score')
plt.xlabel('Features')
plt.xticks(rotation=45) # Rotate feature names for better readability
plt.legend([],[], frameon=False) # Remove legend
plt.show()
As for a target based on multile features, observing the scatter plot, the prediction curve generally should not converge to the orientation of the single first feature because the other features have influence upon the determination of the target. The prediction line concerning the first feature conveys barely any relationship with it. Observing the model, the quadratic term has the dominant influence in the long run.
Now, going back to the real data to implement.
For each target variable (like rain_sum, dew_point_2m_min, etc.):
Use Recursive Feature Elimination (RFE) with a Random Forest Regressor to select the top 5 most important features.
Use those selected features to fit a quantile regression model (median regression, i.e., quantile = 0.5) to understand the effect of those features on the target.
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestRegressor
# List of target variables
targets = ['rain_sum', 'dew_point_2m_min',
'dew_point_2m_max',
'et0_fao_evapotranspiration',
'soil_temperature_0_to_7cm_mean',
'wet_bulb_temperature_2m_min',
'wet_bulb_temperature_2m_max',
'soil_temperature_0_to_7cm_mean']
# Initialize a dictionary to store selected features for each target
selected_features_dict = {}
# Iterate over each target variable
for target in targets:
# Separate independent and dependent variables
X = daily_data_sans_first_col.drop(target, axis=1) # Drop the target column from the features
y = daily_data_sans_first_col[target] # Set the target column
# Initialize a RandomForestRegressor
rf = RandomForestRegressor(n_jobs=-1, max_depth=5)
# Initialize RFE with the desired number of features
rfe = RFE(estimator=rf, n_features_to_select=5)
# Fit RFE
rfe.fit(X, y)
# Get the selected features for the current target
selected_features = rfe.support_
important_features = X.columns[selected_features].tolist()
# Store the selected features in the dictionary
selected_features_dict[target] = important_features
# Print the selected features for the current target
print(f"Selected Features with RFE for {target}: {important_features}")
# Now, for each target, perform quantile regression using the selected features
for target, selected_features in selected_features_dict.items():
# Prepare the independent variables (selected features)
X_selected = daily_data_sans_first_col[selected_features]
# Add a constant (intercept term)
X_selected = sm.add_constant(X_selected)
# Dependent variable (response)
y = daily_data_sans_first_col[target]
# Fit the quantile regression model at the 0.5 quantile
model = sm.QuantReg(y, X_selected)
quantile_50 = model.fit(q=0.5)
# Print the summary of the quantile regression for the current target
print(f"Quantile Regression Summary for {target}:")
print(quantile_50.summary())
print("\n" + "="*80 + "\n")
Selected Features with RFE for rain_sum: ['wind_speed_10m_max', 'et0_fao_evapotranspiration', 'surface_pressure_min', 'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max']
Selected Features with RFE for dew_point_2m_min: ['rain_sum', 'relative_humidity_2m_min', 'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean']
Selected Features with RFE for dew_point_2m_max: ['temperature_2m_max', 'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max', 'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean']
Selected Features with RFE for et0_fao_evapotranspiration: ['temperature_2m_mean', 'apparent_temperature_max', 'wind_speed_10m_max', 'rain_sum', 'vapour_pressure_deficit_max']
Selected Features with RFE for soil_temperature_0_to_7cm_mean: ['temperature_2m_max', 'apparent_temperature_max', 'et0_fao_evapotranspiration', 'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max']
Selected Features with RFE for wet_bulb_temperature_2m_min: ['temperature_2m_mean', 'temperature_2m_max', 'temperature_2m_min', 'dew_point_2m_min', 'wet_bulb_temperature_2m_max']
Selected Features with RFE for wet_bulb_temperature_2m_max: ['temperature_2m_mean', 'temperature_2m_max', 'dew_point_2m_max', 'wet_bulb_temperature_2m_min', 'soil_temperature_0_to_7cm_mean']
Selected Features with RFE for soil_temperature_0_to_7cm_mean: ['temperature_2m_max', 'apparent_temperature_max', 'et0_fao_evapotranspiration', 'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max']
Quantile Regression Summary for rain_sum:
QuantReg Regression Results
==============================================================================
Dep. Variable: rain_sum Pseudo R-squared: 0.1378
Model: QuantReg Bandwidth: 0.2189
Method: Least Squares Sparsity: 2.360
Date: Fri, 27 Jun 2025 No. Observations: 16603
Time: 22:43:22 Df Residuals: 16597
Df Model: 5
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
const 123.4087 5.599 22.043 0.000 112.435 134.383
wind_speed_10m_max 0.0444 0.002 29.388 0.000 0.041 0.047
et0_fao_evapotranspiration -0.5483 0.017 -32.035 0.000 -0.582 -0.515
surface_pressure_min -0.1327 0.006 -23.109 0.000 -0.144 -0.121
relative_humidity_2m_max 0.0922 0.003 32.625 0.000 0.087 0.098
wet_bulb_temperature_2m_max 0.0358 0.009 3.853 0.000 0.018 0.054
===============================================================================================
The condition number is large, 6.01e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
================================================================================
Quantile Regression Summary for dew_point_2m_min:
QuantReg Regression Results
==============================================================================
Dep. Variable: dew_point_2m_min Pseudo R-squared: 0.9204
Model: QuantReg Bandwidth: 0.01984
Method: Least Squares Sparsity: 0.2287
Date: Fri, 27 Jun 2025 No. Observations: 16603
Time: 22:43:22 Df Residuals: 16597
Df Model: 5
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -18.5247 0.079 -235.023 0.000 -18.679 -18.370
rain_sum 0.0047 0.000 24.119 0.000 0.004 0.005
relative_humidity_2m_min 0.2377 0.001 180.456 0.000 0.235 0.240
wet_bulb_temperature_2m_min 0.7980 0.002 444.523 0.000 0.794 0.802
vapour_pressure_deficit_max 3.9190 0.035 111.079 0.000 3.850 3.988
soil_temperature_0_to_7cm_mean 0.0241 0.001 17.985 0.000 0.021 0.027
==================================================================================================
The condition number is large, 7.78e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
================================================================================
Quantile Regression Summary for dew_point_2m_max:
QuantReg Regression Results
==============================================================================
Dep. Variable: dew_point_2m_max Pseudo R-squared: 0.9058
Model: QuantReg Bandwidth: 0.02402
Method: Least Squares Sparsity: 0.3092
Date: Fri, 27 Jun 2025 No. Observations: 16603
Time: 22:43:22 Df Residuals: 16597
Df Model: 5
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const -4.1291 0.032 -128.618 0.000 -4.192 -4.066
temperature_2m_max -0.1994 0.005 -39.337 0.000 -0.209 -0.189
relative_humidity_2m_max 0.0462 0.000 132.738 0.000 0.046 0.047
wet_bulb_temperature_2m_max 1.2237 0.004 291.276 0.000 1.215 1.232
vapour_pressure_deficit_max 0.4028 0.015 26.619 0.000 0.373 0.432
soil_temperature_0_to_7cm_mean -0.0359 0.002 -20.659 0.000 -0.039 -0.033
==================================================================================================
The condition number is large, 2.56e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
================================================================================
Quantile Regression Summary for et0_fao_evapotranspiration:
QuantReg Regression Results
======================================================================================
Dep. Variable: et0_fao_evapotranspiration Pseudo R-squared: 0.3770
Model: QuantReg Bandwidth: 0.08506
Method: Least Squares Sparsity: 1.080
Date: Fri, 27 Jun 2025 No. Observations: 16603
Time: 22:43:22 Df Residuals: 16597
Df Model: 5
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
const 1.0758 0.109 9.836 0.000 0.861 1.290
temperature_2m_mean -0.1625 0.010 -16.569 0.000 -0.182 -0.143
apparent_temperature_max 0.1827 0.005 36.266 0.000 0.173 0.193
wind_speed_10m_max 0.0474 0.001 44.400 0.000 0.045 0.049
rain_sum -0.1016 0.001 -110.055 0.000 -0.103 -0.100
vapour_pressure_deficit_max 1.4703 0.018 80.902 0.000 1.435 1.506
===============================================================================================
The condition number is large, 1.24e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
================================================================================
Quantile Regression Summary for soil_temperature_0_to_7cm_mean:
QuantReg Regression Results
==========================================================================================
Dep. Variable: soil_temperature_0_to_7cm_mean Pseudo R-squared: 0.6308
Model: QuantReg Bandwidth: 0.08111
Method: Least Squares Sparsity: 1.037
Date: Fri, 27 Jun 2025 No. Observations: 16603
Time: 22:43:22 Df Residuals: 16597
Df Model: 5
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
const 3.2281 0.099 32.453 0.000 3.033 3.423
temperature_2m_max 0.9753 0.014 70.922 0.000 0.948 1.002
apparent_temperature_max 0.0731 0.003 23.266 0.000 0.067 0.079
et0_fao_evapotranspiration -0.3908 0.007 -54.470 0.000 -0.405 -0.377
wet_bulb_temperature_2m_min -0.1405 0.011 -12.357 0.000 -0.163 -0.118
vapour_pressure_deficit_max 1.2173 0.061 20.052 0.000 1.098 1.336
===============================================================================================
The condition number is large, 1.11e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
================================================================================
Quantile Regression Summary for wet_bulb_temperature_2m_min:
QuantReg Regression Results
=======================================================================================
Dep. Variable: wet_bulb_temperature_2m_min Pseudo R-squared: 0.8959
Model: QuantReg Bandwidth: 0.02574
Method: Least Squares Sparsity: 0.3332
Date: Fri, 27 Jun 2025 No. Observations: 16603
Time: 22:43:23 Df Residuals: 16597
Df Model: 5
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
const 0.5340 0.029 18.250 0.000 0.477 0.591
temperature_2m_mean 0.3632 0.009 41.830 0.000 0.346 0.380
temperature_2m_max -0.0747 0.004 -17.406 0.000 -0.083 -0.066
temperature_2m_min 0.0433 0.004 9.748 0.000 0.035 0.052
dew_point_2m_min 0.5662 0.002 279.097 0.000 0.562 0.570
wet_bulb_temperature_2m_max 0.0638 0.004 15.458 0.000 0.056 0.072
===============================================================================================
The condition number is large, 1.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
================================================================================
Quantile Regression Summary for wet_bulb_temperature_2m_max:
QuantReg Regression Results
=======================================================================================
Dep. Variable: wet_bulb_temperature_2m_max Pseudo R-squared: 0.9010
Model: QuantReg Bandwidth: 0.02272
Method: Least Squares Sparsity: 0.2932
Date: Fri, 27 Jun 2025 No. Observations: 16603
Time: 22:43:23 Df Residuals: 16597
Df Model: 5
==================================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------------------
const 0.3041 0.025 12.066 0.000 0.255 0.354
temperature_2m_mean 0.2068 0.004 51.538 0.000 0.199 0.215
temperature_2m_max 0.0914 0.004 22.976 0.000 0.084 0.099
dew_point_2m_max 0.6201 0.002 253.660 0.000 0.615 0.625
wet_bulb_temperature_2m_min 0.0606 0.003 20.632 0.000 0.055 0.066
soil_temperature_0_to_7cm_mean -0.0002 0.002 -0.138 0.890 -0.004 0.003
==================================================================================================
The condition number is large, 1.17e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
================================================================================
Interpretation of the Summary Statistics
Number of Observations: value indicates that there are such number of data points used in this regression.
Df Residuals: value represents the degrees of freedom for the residuals (observations - model parameters).
Df Model: value indicates the number of predictors (excluding the constant).
P-values: If the p-values (for the intercept, feature1, feature2,…, featureN) are less than 0.05, there’s indication that the coefficients are statistically significant at a 5% significance level.
Confidence Intervals: The intervals for each coefficient (e.g., [0.025, 0.975]) provide a range within which we can be 95% confident that the true parameter value lies.
Pseudo R-squared: for the value observed, convert to percentage. such measure gives indication that determined percentage accounts for the variability in the target that is explained by the model. Although it's not the same as R-squared in OLS regression, a higher value suggests a good fit
In the context of quantile regression, one of the most commonly used pseudo R² metrics is Koenker and Machado’s pseudo R². This is a specific form of pseudo R² that was developed to assess the fit of quantile regression models.
Koenker and Machado's Pseudo R²: designed to assess how well the model explains the variability in the data, relative to a baseline (often the model that predicts the median).
The Koenker and Machado pseudo R² is defined as:
$$R^2 = 1 - \frac{V(\hat{\theta})}{V(\hat{\theta}_0)}$$$V(\hat{\theta})$ is the sum of weighted absolute residuals for the fitted model.
$V(\hat{\theta}_0)$ is the sum of weighted absolute residuals for the baseline model (usually a model predicting the median).
Interpretation: It compares the sum of residuals of your model to the residuals of a simpler baseline model. If your model performs better than the baseline, the pseudo $R^2$ will be positive, and if it performs worse, it can be negative.
With daily meteorological/climate data forecasting the associated attributes for long-term predictions presents several challenges that make it less practical. Here are some key reasons:
- High Variability:
A respective attribute can be highly variable, influenced by many factors like weather patterns, geographical location, and seasonality. This variability makes it difficult to produce reliable long-term forecasts.
- Inherent Noise:
Daily weather data is often noisy, with short-term fluctuations that can overshadow longer-term trends. This noise can complicate the forecasting process, as models may struggle to discern meaningful patterns.
- Seasonal Patterns:
Attributes like rainfall tends to exhibit seasonal patterns (e.g., wet and dry seasons), which might require different modeling approaches for different times of the year. Long-term forecasts may fail to capture these nuances.
- External Influences:
Long-term rainfall trends can be influenced by climate change, urbanization, and other large-scale environmental changes. These factors may not be adequately represented in historical data used for forecasting.
- Data Limitations:
Daily (rain_sum) data may be limited in historical depth or spatial coverage, especially in areas with fewer weather stations. This can affect the quality of long-term forecasts.
- Non-stationarity:
Climate patterns and (rainfall) distributions can change over time due to climate change and other factors, leading to non-stationarity in the data. This poses challenges for traditional forecasting models, which often assume that historical patterns will persist.
- Forecasting Horizon:
Long-term forecasts (e.g., months or years ahead) may be more appropriate for aggregate measures (like monthly or annual rainfall) rather than daily values. The uncertainty associated with daily forecasts increases significantly over longer time horizons.
- Practical Applications:
Many practical applications (like agriculture or water resource management) might benefit more from monthly or seasonal rainfall forecasts rather than daily forecasts. Longer aggregation periods can provide more relevant information for decision-making.
Instead of using daily attribute measures, consider forecasting monthly or seasonal totals, which may better capture underlying trends and patterns while mitigating some of the issues mentioned above. This could improve the reliability and applicability of your forecasts in various contexts.
The majority of the identified daily meteorological variables don't capture atmospheric physics or atmospheric chemistry to generally (analytically) model precipitation. Such data set is best suited for time series analysis, modeling and forecasting long term to analyze drastic climate variations.
The prior summary statistics for each model convey strong multicollinearity issues. Again, reviewing the correlation heatmap matrix.
# Applying Pearson correlation to the data set
daily_pearson_corr = daily_dataframe_clean.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(daily_pearson_corr, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation heatmap of Montserrat Daily Meteorological Data')
plt.savefig('daily_heatmap.pdf', format = 'pdf')
plt.show()
Observed above are some observations of high correlation among pairs; exclude the diagonal elements due to the obvious.
A potential resolution for such multicollinearity is to observe the feature importance/rank w.r.t. the target in question. After observing the level of importance or rank, to then observe the correlation heat map. For a high positive correlation pair (where the idea of "high" is subjective), to choose the feature of higher importance.
from sklearn.model_selection import train_test_split
# List of targets
targets = [
'rain_sum', 'dew_point_2m_min', 'dew_point_2m_max',
'et0_fao_evapotranspiration', 'soil_temperature_0_to_7cm_mean',
'wet_bulb_temperature_2m_min', 'wet_bulb_temperature_2m_max'
]
# Loop through each target
for target in targets:
print(f"\n{'='*60}\nAnalyzing Target: {target}\n{'='*60}")
# Define features: drop current target from targets list + use all other columns
possible_features = daily_data_sans_first_col.drop(columns=[target])
# Split data
X_train, X_test, y_train, y_test = train_test_split(
possible_features,
daily_data_sans_first_col[target],
test_size=0.2,
random_state=42
)
# Initialize model
rf_model = RandomForestRegressor(n_estimators=50, random_state=42)
# Fit model
rf_model.fit(X_train, y_train)
# Feature importances
importances = rf_model.feature_importances_
feature_importances = pd.DataFrame({
'Feature': X_train.columns,
'Importance': importances
}).sort_values(by='Importance', ascending=False)
# Plot feature importances
plt.figure(figsize=(12, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.title(f'Feature Importances for Target: {target}')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# Print ranked features
print("Ranked Features based on Importance:")
print(feature_importances)
# Recursive Feature Elimination
rfe = RFE(estimator=rf_model, n_features_to_select=5)
rfe.fit(X_train, y_train)
selected_features = X_train.columns[rfe.support_]
print("Selected Features by RFE:")
print(selected_features.tolist())
============================================================ Analyzing Target: rain_sum ============================================================
Ranked Features based on Importance:
Feature Importance
7 et0_fao_evapotranspiration 0.489625
14 relative_humidity_2m_max 0.137958
6 wind_speed_10m_max 0.102763
15 relative_humidity_2m_min 0.027880
19 soil_temperature_0_to_7cm_mean 0.025270
5 apparent_temperature_min 0.023284
16 wet_bulb_temperature_2m_max 0.020404
8 dew_point_2m_max 0.018505
13 pressure_msl_min 0.018271
1 temperature_2m_max 0.016599
18 vapour_pressure_deficit_max 0.016413
11 surface_pressure_min 0.015052
0 temperature_2m_mean 0.013708
2 temperature_2m_min 0.013610
9 dew_point_2m_min 0.012642
4 apparent_temperature_max 0.011980
17 wet_bulb_temperature_2m_min 0.011395
3 apparent_temperature_mean 0.009613
12 pressure_msl_max 0.007716
10 surface_pressure_max 0.007313
Selected Features by RFE:
['wind_speed_10m_max', 'et0_fao_evapotranspiration', 'relative_humidity_2m_max', 'relative_humidity_2m_min', 'wet_bulb_temperature_2m_max']
============================================================
Analyzing Target: dew_point_2m_min
============================================================
Ranked Features based on Importance:
Feature Importance
17 wet_bulb_temperature_2m_min 0.939701
15 relative_humidity_2m_min 0.049060
1 temperature_2m_max 0.002069
8 rain_sum 0.001350
2 temperature_2m_min 0.001314
16 wet_bulb_temperature_2m_max 0.001238
19 soil_temperature_0_to_7cm_mean 0.000893
18 vapour_pressure_deficit_max 0.000656
14 relative_humidity_2m_max 0.000488
7 et0_fao_evapotranspiration 0.000439
0 temperature_2m_mean 0.000429
4 apparent_temperature_max 0.000416
9 dew_point_2m_max 0.000410
5 apparent_temperature_min 0.000294
6 wind_speed_10m_max 0.000275
3 apparent_temperature_mean 0.000275
10 surface_pressure_max 0.000199
11 surface_pressure_min 0.000191
12 pressure_msl_max 0.000153
13 pressure_msl_min 0.000150
Selected Features by RFE:
['temperature_2m_max', 'temperature_2m_min', 'relative_humidity_2m_min', 'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min']
============================================================
Analyzing Target: dew_point_2m_max
============================================================
Ranked Features based on Importance:
Feature Importance
16 wet_bulb_temperature_2m_max 0.950799
14 relative_humidity_2m_max 0.030534
1 temperature_2m_max 0.005117
19 soil_temperature_0_to_7cm_mean 0.003065
18 vapour_pressure_deficit_max 0.002229
2 temperature_2m_min 0.001954
15 relative_humidity_2m_min 0.001047
17 wet_bulb_temperature_2m_min 0.000822
7 et0_fao_evapotranspiration 0.000605
0 temperature_2m_mean 0.000540
8 rain_sum 0.000458
5 apparent_temperature_min 0.000438
9 dew_point_2m_min 0.000422
4 apparent_temperature_max 0.000403
6 wind_speed_10m_max 0.000361
3 apparent_temperature_mean 0.000329
11 surface_pressure_min 0.000240
10 surface_pressure_max 0.000238
13 pressure_msl_min 0.000205
12 pressure_msl_max 0.000194
Selected Features by RFE:
['temperature_2m_max', 'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max', 'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean']
============================================================
Analyzing Target: et0_fao_evapotranspiration
============================================================
Ranked Features based on Importance:
Feature Importance
18 vapour_pressure_deficit_max 0.374105
7 rain_sum 0.218456
4 apparent_temperature_max 0.063076
6 wind_speed_10m_max 0.049613
11 surface_pressure_min 0.047105
0 temperature_2m_mean 0.039581
19 soil_temperature_0_to_7cm_mean 0.034692
14 relative_humidity_2m_max 0.034592
15 relative_humidity_2m_min 0.019304
5 apparent_temperature_min 0.017646
2 temperature_2m_min 0.015994
3 apparent_temperature_mean 0.012797
1 temperature_2m_max 0.011430
8 dew_point_2m_max 0.010334
10 surface_pressure_max 0.010178
16 wet_bulb_temperature_2m_max 0.009248
17 wet_bulb_temperature_2m_min 0.008927
13 pressure_msl_min 0.008315
9 dew_point_2m_min 0.007757
12 pressure_msl_max 0.006849
Selected Features by RFE:
['temperature_2m_mean', 'apparent_temperature_max', 'rain_sum', 'surface_pressure_min', 'vapour_pressure_deficit_max']
============================================================
Analyzing Target: soil_temperature_0_to_7cm_mean
============================================================
Ranked Features based on Importance:
Feature Importance
1 temperature_2m_max 0.778519
19 vapour_pressure_deficit_max 0.070758
4 apparent_temperature_max 0.031376
7 et0_fao_evapotranspiration 0.021642
18 wet_bulb_temperature_2m_min 0.015745
0 temperature_2m_mean 0.010050
6 wind_speed_10m_max 0.009869
8 rain_sum 0.007562
2 temperature_2m_min 0.007029
9 dew_point_2m_max 0.006860
12 surface_pressure_min 0.005714
3 apparent_temperature_mean 0.005442
15 relative_humidity_2m_max 0.004697
5 apparent_temperature_min 0.004571
11 surface_pressure_max 0.004035
14 pressure_msl_min 0.003829
17 wet_bulb_temperature_2m_max 0.003684
10 dew_point_2m_min 0.003024
16 relative_humidity_2m_min 0.002981
13 pressure_msl_max 0.002614
Selected Features by RFE:
['temperature_2m_max', 'apparent_temperature_max', 'et0_fao_evapotranspiration', 'wet_bulb_temperature_2m_min', 'vapour_pressure_deficit_max']
============================================================
Analyzing Target: wet_bulb_temperature_2m_min
============================================================
Ranked Features based on Importance:
Feature Importance
10 dew_point_2m_min 0.807159
17 wet_bulb_temperature_2m_max 0.151958
0 temperature_2m_mean 0.016057
2 temperature_2m_min 0.013240
1 temperature_2m_max 0.001397
15 relative_humidity_2m_max 0.001220
9 dew_point_2m_max 0.001170
18 vapour_pressure_deficit_max 0.001149
5 apparent_temperature_min 0.001103
16 relative_humidity_2m_min 0.001088
19 soil_temperature_0_to_7cm_mean 0.000786
3 apparent_temperature_mean 0.000676
4 apparent_temperature_max 0.000601
7 et0_fao_evapotranspiration 0.000579
8 rain_sum 0.000539
6 wind_speed_10m_max 0.000415
12 surface_pressure_min 0.000251
11 surface_pressure_max 0.000228
14 pressure_msl_min 0.000194
13 pressure_msl_max 0.000191
Selected Features by RFE:
['temperature_2m_mean', 'temperature_2m_min', 'dew_point_2m_min', 'wet_bulb_temperature_2m_max', 'vapour_pressure_deficit_max']
============================================================
Analyzing Target: wet_bulb_temperature_2m_max
============================================================
Ranked Features based on Importance:
Feature Importance
9 dew_point_2m_max 0.954012
0 temperature_2m_mean 0.027019
1 temperature_2m_max 0.007559
10 dew_point_2m_min 0.001741
15 relative_humidity_2m_max 0.001664
19 soil_temperature_0_to_7cm_mean 0.001072
16 relative_humidity_2m_min 0.000934
17 wet_bulb_temperature_2m_min 0.000914
2 temperature_2m_min 0.000824
8 rain_sum 0.000634
4 apparent_temperature_max 0.000547
7 et0_fao_evapotranspiration 0.000530
3 apparent_temperature_mean 0.000475
18 vapour_pressure_deficit_max 0.000469
5 apparent_temperature_min 0.000425
6 wind_speed_10m_max 0.000353
11 surface_pressure_max 0.000236
12 surface_pressure_min 0.000225
14 pressure_msl_min 0.000185
13 pressure_msl_max 0.000180
Selected Features by RFE:
['temperature_2m_mean', 'temperature_2m_max', 'dew_point_2m_max', 'dew_point_2m_min', 'relative_humidity_2m_max']
AGAIN, a potential resolution for such multicollinearity is to observe the feature importance/rank w.r.t. the target in question. After observing the level of importance or rank, to then observe the correlation heat map. For a high positive correlation pair (where the idea of "high" is subjective), to choose the feature of higher importance.
Analyzing How Weather Patterns Have Changed Over Time for a Particular Month Across Multiple Years¶
This analysis involves examining historical weather data for a specific month across several years. By studying variables such as temperature, precipitation, relative humidity, wind patterns, etc. Researchers can identify trends and changes in climate over time. This information is crucial for understanding long-term climate variability, predicting future weather patterns, and assessing the impacts of climate change.
This code to effectively filter the meteorological data for a specific month, calculates the average maximum temperature for that month across different years, and visualizes the results in a line plot. Code can be adjusted to analyze other months as needed.
Analysis of a particular variable's trend within a particular month across multiple years. To visualize the average maximum of a variable in question for the selected month over time, providing valuable insights into climate patterns and potential changes. The month of July is chosen because such period typically records the highest temperatures in the northern hemisphere; January also chosen because this period typically records the lowest temperatures related to the Earth's tilt w.r.t. its axis.
Case for Max Temperature 2 meters above ground in July:
# Extract year and month from the 'date' column using .loc to avoid SettingWithCopyWarning
daily_dataframe_part = daily_dataframe.copy()
daily_dataframe_part.loc[:, 'year'] = daily_dataframe_part['date'].dt.year
daily_dataframe_part.loc[:, 'month'] = daily_dataframe_part['date'].dt.month
# Filter for a specific month (e.g., July = 7)
specific_month = 7 # Change this to the month you want to analyze
monthly_data = daily_dataframe_part[daily_dataframe_part['month'] == specific_month]
# Group by year and calculate the mean of a specific variable, e.g., 'temperature_2m_max'
monthly_mean = monthly_data.groupby('year')['temperature_2m_max'].mean().reset_index()
# Plotting the results
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_mean, x='year', y='temperature_2m_max', marker='o') # This line is from seaborn
plt.title(f'Average Max Temperature in Month {specific_month} Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Max Temperature (°C)')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
Case for Min Temperature 2 meters above ground in July:
# Ensure 'date' is a datetime object
daily_dataframe['date'] = pd.to_datetime(daily_dataframe['date'])
# Extract 'month' and 'year' from the 'date' column
daily_dataframe['month'] = daily_dataframe['date'].dt.month
daily_dataframe['year'] = daily_dataframe['date'].dt.year
# Now filter by month
specific_month = 7 # July
monthly_data = daily_dataframe[daily_dataframe['month'] == specific_month]
# Group by year and calculate the mean of temperature_2m_min
monthly_mean = monthly_data.groupby('year')['temperature_2m_min'].mean().reset_index()
# Plotting the results
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_mean, x='year', y='temperature_2m_min', marker='o') # This line is from seaborn
plt.title(f'Average Min Temperature in Month {specific_month} Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Min Temperature (°C)')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
Case for Precipitation in July:
# Filter for a specific month (e.g., July = 7)
specific_month = 7 # Change this to the month you want to analyze
daily_dataframe = daily_dataframe[daily_dataframe['month'] == specific_month]
# Group by year and calculate the mean of 'rain_sum'
monthly_mean = monthly_data.groupby('year')['rain_sum'].mean().reset_index()
# Plotting the results
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_mean, x='year', y='rain_sum', marker='o', color='blue')
plt.title(f'Average Rain Sum Per Day in Month {specific_month} Over the Years')
plt.xlabel('Year')
plt.ylabel('Average Rain Sum Per Day')
plt.xticks(monthly_mean['year'], rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()
Concerning the above scatter plots, for TMIN AND TMAX one should keep in mind that the average for each 1st month (January) of each year is considered, and not actual minimum of maximum. So, highly dynamic curves are not to be expected. A noticeable change in slope conveys that daily lower maximums or minimuns each year are increasing.
Statistical Method to Identify Significant Change in Climate¶
Statistical Significance
In statistics, well-known, the p-value (exhaustively applied) indicates whether there's statistical significance (whether difference or influence) in the segmentation of the variable(s) between the two periods. A p-value less than 0.05 suggests a significance (whether difference or influence).
Wilson Signed-Rank Test and Mann-Whitney U Test: A Comparative Overview¶
The Wilcoxon Signed-Rank Test and the Mann-Whitney U Test are two nonparametric statistical tests commonly used to compare the medians of two groups. These tests are particularly useful when the data does not meet the assumptions of parametric tests like the t-test, such as normality or homogeneity of variance.
Wilson Signed-Rank Test
The Wilcoxon Signed-Rank Test is used when the two groups being compared are paired or dependent (Hayes 2019). This means that each observation in one group corresponds to a specific observation in the other group. For example, it can be used to compare the pre- and post-treatment scores of the same individuals.
The test ranks the absolute differences between the paired observations and then sums the ranks of the differences that have the same sign. The resulting sum is compared to a critical value to determine if there is a significant difference between the medians of the two groups.
WSRT-HYPOTHESES --
A. Null Hypothesis: median difference between the paired observations is zero ($M_D = 0$).
$$H_0:M_D = 0$$B. Alternative Hypothesis: median difference between the paired observations is not zero ($M_D \neq\, 0$). Such can be non-directional (as is in median difference not being 0); Directional, as in median difference being positive (say, if group 1 is greater than group 2); Directional, as in in median difference being negative (say, if group 1 is less than group 2).
C. *Test Statistic: calculate the signed ranks of the differences between paired observations, then sum the ranks for the positive and negative differences to obtain the test statistic.
D. Decision Rule: compare the test statistic to critical values from the Wilcoxon signed-rank table or use a p-value to determine significance. Reject $H_0$ if the p-value is less than the chosen significance level ($\alpha$).
STEP 1: Calculate the Paired Differences --
Given two related samples, $X = \{x_1,x_2,...,x_n\}$ and $Y = \{y_1,y_2,...,y_n\}$
$$D_i = x_i - y_i$$where $i$ = 1,2,....n.
Ignore pairs where $D_i = 0$ (ties are removed).
STEP 2: Compute Absolute Differences and Ranks --
Compute the absolute differences:
$$|D_i|\,\,\text{for}\,\,D_i\,\neq\,0$$Rank the absolute didfferences in ascending order. Assign average ranks for tied values.
STEP 3: Assign Signs to Ranks --
Restore the sign of $D_i$ to its rank:
$$R_i = \begin{cases} +\text{Rank}(|D_i|) & \text{if } D_i > 0, \\ -\text{Rank}(|D_i|) & \text{if } D_i < 0. \end{cases}$$STEP 4: Compute the Taest Statistic --
Calculate the positive rank sum ($W^+$) and the negative rank sum ($W^-$):
$$RW^+ = \sum_{R_i > 0} R_i, \quad W^- = \sum_{R_i < 0} |R_i|$$The test statistic $W$ is the smaller of the two:
$$W = \text{min}(W^+, W^-)$$Mann-Whitney U Test
The Mann-Whitney U Test, also known as the Wilcoxon Rank-Sum Test, is used when the two groups being compared are independent (MacFaraland and Yates 2016). This means that there is no one-to-one correspondence between the observations in the two groups. For example, it can be used to compare the test scores of two different groups of students.
The test ranks all observations from both groups combined, then calculates the sum of the ranks for one of the groups. This sum is compared to a critical value to determine if there is a significant difference between the medians of the two groups.
MWUT-HYPOTHESES--
A. Null Hypothesis: the distributions of the two groups are equal.
$$H_0: F_X(t) = F_Y(t) \forall\,t$$where $F_X(t)$ and $F_Y(t)$ are the cumulative distribution functions of the two populations.
B. Alternative Hypothesis: the distributions of the two groups are not equal. This can be directional (one-tailed) or non-directional (two-tailed); the former, simply establishing non-equivalence, while the latter being the greater than or less than case with distribution contrast.
$$H_1: F_X(t) \neq\,\, F_Y(t) \text{ for some}t$$C. Test Statistic: Calculate the U statistic, which is based on the ranks of the combined data from both groups.
STEP 1: Combine and Rank the data --
Let $X = \{x_1,x_2,...,x_n\}$ and $Y = \{y_1,y_2,...,y_n\}$ represent the tow independent samples.
Combine the two samples into a single data set.
Rank all observations in ascending order, assigning averaged ranks to tied values.
STEP 2: Compute the Ranked Sums --
Compute the sum of ranks for each group:
$$R_X = \sum_{x \in X} \text{Rank}(x), \quad R_Y = \sum_{y \in Y} \text{Rank}(y)$$STEP 3: Compute the U Statistic --
Calculate the U Statistic for each group:
$$U_X = R_X - \frac{n_X (n_X + 1)}{2}, \quad U_Y = R_Y - \frac{n_Y (n_Y + 1)}{2}$$The two statistics are related:
$$U_x + U_Y = n_X\,n_Y$$The test statistic U is the smaller of the two:
$$U = \text{min}(U_X,U_Y)$$D. Compare the calculated U statistic to critical values from the Mann-Whitney U distribution table or use a p-value to determine significance. Reject the null hypothesis if the p-value is less than the chosen significance level ($\alpha$).
Considerations
The Wilcoxon Signed-Rank Test is typically used when you have paired observations, meaning that each data point in one period has a corresponding data point in the other period. For example, if you compare temperatures for the same month in two different years (e.g., January 1912 vs. January 1968), then the observations are paired.
If comparing aggregate measures (like average monthly temperatures) between two distinct periods without direct pairing of observations (i.e., you have one group of data for the earlier period and another group for the later period without matching), then the Wilcoxon Signed-Rank Test may not be appropriate. Instead, the Mann-Whitney U Test could be considered if you treat the two periods as independent groups. Both tests can be implemented with the appropriate specification, however, to implement the Mann-Whitney U Test, because the concern generally is weather data from different time periods.
When splitting the weather dataset into two periods (1912–1967 and 1968–2024) and compare them, the comparison is primarily focused on the attributes (e.g., temperature, precipitation, snowfall) within each period, rather than the years themselves. The reasoning:
- Attributes as the Basis of Comparison:
The aim of your analysis is to determine if there are statistically significant differences in the distribution, mean, median, or other characteristics of weather attributes between the two periods.
For instance, you may be interested in seeing if the average temperature or average precipitation levels have changed significantly from one period to the next.
The years themselves are just the framework for segmenting the data; the attributes (like temperature or precipitation) are the variables you are actually comparing.
- Years as Contextual Groupings:
By splitting the dataset based on years, you are essentially creating two groups or "batches" of data where the weather attributes are measured across different time frames.
The two time periods serve as the independent grouping factor, and you are testing whether the attributes (temperature, precipitation, etc.) exhibit different patterns between these two periods.
- Temporal Influence and Aggregation:
Weather data is inherently time-dependent, and by aggregating the data within each period, you are accounting for the overall trend or changes that might have happened over those years.
The comparison reflects how the overall weather patterns or averages of each attribute differ between these long-term periods, rather than focusing on year-to-year variations.
from scipy.stats import mannwhitneyu
ddy = daily_dataframe.copy()
period1 = ddy.loc['1980':'2002']
period2 = ddy.loc['2003':'2025']
Histogram of Period 1 Attributes¶
# Get the column names
column_names = period1.columns
print(column_names)
column_names_list = column_names.tolist()
# Calculating the number of ros and columns for subplots.
num_cols = 3 # 3 columns
num_rows = (len(column_names_list) + num_cols - 1) // num_cols
# Calculating the number of rows
# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))
# Flatten if required.
if num_rows > 1:
axes = axes.flatten()
# Plot the histograms
for i, col in enumerate(column_names_list):
sns.histplot(data = period1[col], ax = axes[i], kde = True)
axes[i].set_title(f'Histogram of Period 1 {col}')
axes[i].set_xlabel('Value')
axes[i].set_ylabel('Frequency')
axes[i].grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['date', 'temperature_2m_mean', 'temperature_2m_max',
'temperature_2m_min', 'apparent_temperature_mean',
'apparent_temperature_max', 'apparent_temperature_min',
'wind_speed_10m_max', 'et0_fao_evapotranspiration', 'rain_sum',
'dew_point_2m_max', 'dew_point_2m_min', 'surface_pressure_max',
'surface_pressure_min', 'pressure_msl_max', 'pressure_msl_min',
'relative_humidity_2m_max', 'relative_humidity_2m_min',
'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min',
'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean',
'month', 'year'],
dtype='object')
Histogram of Period 2 Attributes¶
# Get the column names
column_names = period2.columns
print(column_names)
column_names_list = column_names.tolist()
# Calculating the number of ros and columns for subplots.
num_cols = 3 # 3 columns
num_rows = (len(column_names_list) + num_cols - 1) // num_cols
# Calculating the number of rows
# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))
# Flatten if required.
if num_rows > 1:
axes = axes.flatten()
# Plot the histograms
for i, col in enumerate(column_names_list):
sns.histplot(data = period2[col], ax = axes[i], kde = True)
axes[i].set_title(f'Histogram of Period 2 {col}')
axes[i].set_xlabel('Value')
axes[i].set_ylabel('Frequency')
axes[i].grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['date', 'temperature_2m_mean', 'temperature_2m_max',
'temperature_2m_min', 'apparent_temperature_mean',
'apparent_temperature_max', 'apparent_temperature_min',
'wind_speed_10m_max', 'et0_fao_evapotranspiration', 'rain_sum',
'dew_point_2m_max', 'dew_point_2m_min', 'surface_pressure_max',
'surface_pressure_min', 'pressure_msl_max', 'pressure_msl_min',
'relative_humidity_2m_max', 'relative_humidity_2m_min',
'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min',
'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean',
'month', 'year'],
dtype='object')
Mann-Whitney Test for Differences:
# List of columns to compare
import numpy as np
columns_to_compare = ['temperature_2m_max', 'temperature_2m_min',
'wind_speed_10m_max', 'rain_sum',
'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max',
'soil_temperature_0_to_7cm_mean']
# Loop through each column and perform the Mann-Whitney U Test
for column in columns_to_compare:
stat, p_value = mannwhitneyu(period1[column], period2[column])
Mean_Period_1 = np.mean(period1[column])
Mean_Period_2 = np.mean(period2[column])
print(f'Column: {column}')
print(f'Mann-Whitney U Test Statistic: {stat}')
print(f'p-value: {p_value}')
print(f'Mean_Period_1: {Mean_Period_1}')
print(f'Mean_Period_2: {Mean_Period_2}\n')
# Check significance
if p_value < 0.05:
print(f"There is a significant difference in {column} between the two periods.")
# Compare medians to determine which period is elevated
if Mean_Period_1 < Mean_Period_2:
print(f"{column} is elevated in Period 2.")
else:
print(f"{column} is elevated in Period 1.")
else:
print(f"There is no significant difference in {column} between the two periods.\n")
print(f'Next column to be evaluated:\n')
Column: temperature_2m_max Mann-Whitney U Test Statistic: 186.5 p-value: 0.6135340544650532 Mean_Period_1: 25.39150047302246 Mean_Period_2: 25.283384323120117 There is no significant difference in temperature_2m_max between the two periods. Next column to be evaluated: Column: temperature_2m_min Mann-Whitney U Test Statistic: 168.5 p-value: 0.8253449727843322 Mean_Period_1: 23.866498947143555 Mean_Period_2: 23.766822814941406 There is no significant difference in temperature_2m_min between the two periods. Next column to be evaluated: Column: wind_speed_10m_max Mann-Whitney U Test Statistic: 34.0 p-value: 0.059767510390904895 Mean_Period_1: 30.130117416381836 Mean_Period_2: 35.02366256713867 There is no significant difference in wind_speed_10m_max between the two periods. Next column to be evaluated: Column: rain_sum Mann-Whitney U Test Statistic: 114.0 p-value: 0.5331918559410417 Mean_Period_1: 0.6499999761581421 Mean_Period_2: 2.3805196285247803 There is no significant difference in rain_sum between the two periods. Next column to be evaluated: Column: relative_humidity_2m_max Mann-Whitney U Test Statistic: 129.0 p-value: 0.699529996606967 Mean_Period_1: 87.6551284790039 Mean_Period_2: 87.71660614013672 There is no significant difference in relative_humidity_2m_max between the two periods. Next column to be evaluated: Column: wet_bulb_temperature_2m_max Mann-Whitney U Test Statistic: 153.0 p-value: 0.9937154462383508 Mean_Period_1: 22.607219696044922 Mean_Period_2: 22.634029388427734 There is no significant difference in wet_bulb_temperature_2m_max between the two periods. Next column to be evaluated: Column: soil_temperature_0_to_7cm_mean Mann-Whitney U Test Statistic: 198.0 p-value: 0.49314501524676035 Mean_Period_1: 25.866500854492188 Mean_Period_2: 25.726337432861328 There is no significant difference in soil_temperature_0_to_7cm_mean between the two periods. Next column to be evaluated:
Permutation Tests¶
Permutation tests are a non-parametric method suitable for comparing two groups without assuming normality.
Null Hypothesis: there is no significant difference in the mean values of the specified attribute (e.g., PRCP, SNOW, etc.) between the two periods.
Alternative Hypothesis: there is a significant difference in the mean values of the specified attribute between the two periods.
Test Statistic: the difference in the means of the attribute values between the two periods
Let Let $X = \{x_1,x_2,...,x_n\}$ and $Y = \{y_1,y_2,...,y_n\}$ represent the tow independent samples.
Define a test statistic $T(X,Y)$ which measures the difference between the two groups. Will consider the mean:
$$T(X,Y) = \bar{X}-\bar{Y}$$where $\bar{X}$ and $\bar{Y}$ are the sample means of $X$ and $Y$, respectively,
Permutation Distribution
STEP 1: Combine the Data --
$$Z = X\,\cup\,Y$$STEP 2: Generate Permutations --
Randomly shuffle the combined dataset $Z$ to generate all possible permutations (or a large subset of permutations in practice due to computational constraints). Express a single permutation as $Z^*$, which is then split into two groups:
$$Z^* = {X^*,Y^*}$$where $X^*$ and $Y^*$ have the same sample sizes as the original $X$ and $Y$.
STEP 3: Compute the Test Statistic for each Permutation --
For each permutation $Z^*$, compute the test statistic:
$$T^* = T(X^*,Y^*)$$STEP 4: Construct the Permutation Distribution --
The set of test statistics across all permutations forms the permutation distribution:
$$\{T_1^*, T_2^*, \dots, T_k^*\}$$where $k$ is the total number of permutations.
P-value Calculation
The p-value to serve as the proportion of premutation test statistics that are extreme or more extreme than the observed test statistic, say, $T_{obs} = T(X,Y)$:
$$p = \frac{\#\{T^* \geq T_{\text{obs}}\}}{\text{Total permutations}}$$For a two-tailed test:
$$p = \frac{\#\{|T^*| \geq |T_{\text{obs}}|\}}{\text{Total permutations}}$$Significance Level: the standard 0.05 for alpha.
Decision Rule: if p-value is less than 0.05, then reject the null hypothesis; thus identifying significant difference in the respective attribute between the two periods. Else, no significant difference between the two periods.
Such to be implemented, serving as a "second opinion" model. The implementation:
from scipy.stats import permutation_test
# Sample data
# period1_data and period2_data should be the data for each period (e.g., temperature values)
def test_permutation(column):
# Define the test statistic as the difference in means
test_statistic = lambda x, y: x.mean() - y.mean()
# Perform the permutation test
result = permutation_test((period1[column], period2[column]),
test_statistic,
alternative='two-sided',
n_resamples=10000,
random_state=42)
# Print the p-value
p_value = result.pvalue
print(f"{column} - p-value: {p_value:.4f}")
# Check if the p-value indicates a significant difference
if p_value < 0.05:
print(f"Significant difference detected in {column} between the two periods.")
else:
print(f"No significant difference detected in {column} between the two periods.")
# Columns to test
columns = ['temperature_2m_max', 'temperature_2m_min',
'wind_speed_10m_max', 'rain_sum',
'relative_humidity_2m_max', 'wet_bulb_temperature_2m_max',
'soil_temperature_0_to_7cm_mean']
# Loop through each column and run the tes
for col in columns:
test_permutation(col)
temperature_2m_max - p-value: 0.7633 No significant difference detected in temperature_2m_max between the two periods. temperature_2m_min - p-value: 0.8561 No significant difference detected in temperature_2m_min between the two periods. wind_speed_10m_max - p-value: 0.0774 No significant difference detected in wind_speed_10m_max between the two periods. rain_sum - p-value: 0.4678 No significant difference detected in rain_sum between the two periods. relative_humidity_2m_max - p-value: 0.9007 No significant difference detected in relative_humidity_2m_max between the two periods. wet_bulb_temperature_2m_max - p-value: 0.9077 No significant difference detected in wet_bulb_temperature_2m_max between the two periods. soil_temperature_0_to_7cm_mean - p-value: 0.5735 No significant difference detected in soil_temperature_0_to_7cm_mean between the two periods.
Extreme Value Analysis¶
To now resort to reapplying the data ranging from year 1869 to year 2022; however due to being cleaned it's shortened. Being a historical daily meteorological data set for Central Park, New York. Such data is a Kaggle data set, namely, "New York City Weather: A 154 year Retrospective". Such daily data was not initially applied to time series analysis because too much instances are missing towards performing decent time series analysis (including cointegration analysis). However, such data set can be adequate for Extreme Value Analysis (EVA).
Extreme value analysis (EVA) is essential in climate science to study rare and extreme climate events, such as heatwaves, cold spells, floods, droughts, or storms. These events have significant impacts, and EVA provides a foundation to quantify their frequency, magnitude, and associated risks.
Steps for Extreme Value Analysis in Climate Data:
- Data Preparation --
Choose the Variable of Interest: Common climate variables include temperature, precipitation, wind speed, sea level, etc.
Filter the Data for Extremes: Focus on the most relevant extremes. For example: High extremes: Heatwaves (extremely high temperatures over a decent period).
Low extremes: Cold spells (extremely low temperatures).
Seasonal Adjustment: Climate data often has strong seasonal trends.
De-seasonalize the data by removing the seasonal component if necessary.
- Select an Extreme Value Model
The GENERALIZED EXTREME VALUE (GEV) distribution is a key concept in extreme value theory (EVT), which deals with the statistical modeling of extreme deviations from the median of probability distributions. The GEV distribution is used to model the behavior of the maximum (or minimum) of a large number of random variables, and it arises naturally when considering the limiting distribution of block maxima.
The GEV distribution combines three different types of distributions that arise in extreme value theory:
Gumbel distribution (Type I) – Models light-tailed extremes (e.g., normal or exponential distributions).
Fréchet distribution (Type II) – Models heavy-tailed extremes (e.g., power-law behavior like Pareto distributions).
Weibull distribution (Type III) – Models bounded upper extremes (e.g., distributions that have an upper limit).
These three distributions are unified under the GEV through a shape parameter, $\xi$, which determines the type:
$\xi$ = 0 (Gumbel)
$\xi$ > 0 (Fréchet)
$\xi$ < 0 (Weibull)
The GEV distribution is parameterized by three values:
$\mu$ (location parameter): Determines where the distribution is centered.
$\sigma$ (scale parameter): Controls the spread or scale of the distribution.
$\xi$ (shape parameter): Determines the tail behavior, distinguishing between Gumbel, Fréchet, and Weibull types.
The cumulative distribution function (CDF) of the GEV is given by:
$$F(x; \mu, \sigma, \xi) = \exp \left\{ - \left[ 1 + \xi \left( \frac{x - \mu}{\sigma} \right) \right]^{- \frac{1}{\xi}} \right\}, \quad \text{for} \quad 1 + \xi \left( \frac{x - \mu}{\sigma} \right) > 0$$Fits the Generalized Extreme Value (GEV) distribution to the block maxima. GEV combines three families of extreme value distributions: Gumbel, Frechet, and Weibull.
From the National Aeronautics and Space Administration (NASA) the resulting probability distribution function (PDF) for two category of shape parameter (i.e., whether it is equal to zero or not) is
$$\frac{1}{\sigma}t(x)^{x+1}e^{-t(x)}$$where
$$t(x) = \begin{cases} (1 + \xi \frac{x-\mu}{\sigma})^{-1/\xi}, & \text{if } \xi \neq 0 \\ e^{-(x-\mu)/\sigma}, & \text{if } \xi = 0 \end{cases}$$Python code to plot the three types of GEV densities:
from scipy.stats import genextreme as gev
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Set Seaborn style for pretty plots
sns.set(style="whitegrid")
# Define parameters for the three types of GEV distributions
params = {
"Gumbel (Type I)": {'shape': 0, 'loc': 0, 'scale': 1},
"Frechet (Type II)": {'shape': 0.5, 'loc': 0, 'scale': 1},
"Weibull (Type III)": {'shape': -0.5, 'loc': 0, 'scale': 1}
}
# Create a figure and axis
fig, ax = plt.subplots(figsize=(10, 6))
# Plot GEV densities for each type
x = np.linspace(-5, 5, 1000) # Common range for all GEV types
colors = ['coral', 'skyblue', 'limegreen']
for i, (label, param) in enumerate(params.items()):
# Extract shape, loc, and scale
shape = param['shape']
loc = param['loc']
scale = param['scale']
# Generate GEV PDF
pdf = gev.pdf(x, shape, loc=loc, scale=scale)
# Plot the PDF
ax.plot(x, pdf, label=f'{label}', color=colors[i], lw=2)
# Add labels and title
ax.set_title("GEV Distributions (Type I: Gumbel, Type II: Frechet, Type III: Weibull)", fontsize=16)
ax.set_xlabel("Value", fontsize=12)
ax.set_ylabel("Density", fontsize=12)
# Add legend
ax.legend(loc='upper right')
# Show the plot
plt.show()
There are two main approaches to extreme value modeling:
Block Maxima Approach --
Break your data into blocks (e.g., yearly or monthly) and only take the maximum (or minimum) value from each block.
Peak Over Threshold (POT) Approach --
Define a threshold (high quantile) and analyze values exceeding this threshold. Fits the Generalized Pareto Distribution (GPD) to exceedances over the threshold. The choice of threshold is critical; it should be high enough to capture extremes without being too high to leave out too many data points.The probability density function (PDF) of the Generalized Pareto Distribution (GPD) is given by:
$$f(x) = \begin{cases} \frac{1}{\sigma}\left(1 + \frac{\xi(x-\mu)}{\sigma}\right)^{-(1+\xi)^{-1}}, & \text{if } \xi \neq 0 \\ \frac{1}{\sigma}e^{-(x-\mu)/\sigma}, & \text{if } \xi = 0 \end{cases}$$The Generalized Pareto Distribution (GPD) is a flexible distribution often used to model extreme value phenomena. Its probability density function (PDF) is characterized by three parameters: $\mu$, $\sigma$, and $\xi$.
from scipy.stats import genpareto
# Parameters for the Generalized Pareto Distribution
shape_param = 0.5 # ξ (shape parameter)
scale_param = 1.0 # σ (scale parameter)
loc_param = 0.0 # μ (location parameter)
# Generate x values
x = np.linspace(0, 10, 1000)
# PDF and CDF using scipy's genpareto
pdf_values = genpareto.pdf(x, c=shape_param, loc=loc_param, scale=scale_param)
cdf_values = genpareto.cdf(x, c=shape_param, loc=loc_param, scale=scale_param)
# Plotting the PDF
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(x, pdf_values, label='PDF')
plt.title('Generalized Pareto Distribution (PDF)')
plt.xlabel('x')
plt.ylabel('Density')
plt.grid(True)
plt.legend()
# Plotting the CDF
plt.subplot(1, 2, 2)
plt.plot(x, cdf_values, label='CDF', color='orange')
plt.title('Generalized Pareto Distribution (CDF)')
plt.xlabel('x')
plt.ylabel('Cumulative Probability')
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
- The functions of the parameters both in GEV and GPD have the same function. Fit the Distribution GEV (Generalized Extreme Value) Distribution: In the block maxima method, you can fit the GEV distribution using Maximum Likelihood Estimation (MLE). The GEV distribution has three parameters –
Location ($\mu$): Determines the center of the distribution.
Scale ($\sigma$): Determines the spread.
Shape ($\xi$): Governs the tail behavior (whether it is heavy, light, or bounded).
GPD (Generalized Pareto Distribution). In the POT approach, fit the GPD to the excesses above the chosen threshold. The GPD distribution also has shape, scale, and threshold parameters.
- The return level is the value that is expected to be exceeded once on average every T years. This is crucial for risk assessment:
Return Level: For the GEV distribution, the $T$-year return level $z_T$ is computed by --
$$z_T = \mu + \frac{\xi}{\sigma} \left[ \left( -\ln\left(1 - \frac{1}{T}\right) \right)^{-\xi} - 1 \right]$$For the GPD distribution, return levels are computed based on the scale and shape parameters for exceedances.
- Diagnostics and Model Checking After fitting the model, it’s important to check the fit:
Quantile-Quantile (Q-Q) Plots -- Check whether the fitted distribution matches the observed extremes.
Return Level Plot -- Plot return levels against the return periods. This helps validate that your model accurately predicts extreme events for longer return periods.
Residual Analysis -- Analyze residuals to see if they show any patterns (residuals should be randomly distributed).
- Interpreting the Results
Return Period -- The expected number of years between extreme events of a certain magnitude. For example, the "100-year event" refers to an event that has a 1% chance of occurring in any given year.
Probability of Exceedance -- The likelihood that an extreme event will exceed a given threshold in a particular year.
Considerations in Climate Data EVA --
Stationarity: Many climate datasets are not stationary due to long-term trends (e.g., rising temperatures due to global warming). It may be necessary to de-trend the data before performing EVA.
Seasonality: Climate data is highly seasonal. You may need to separate the extremes by season or adjust for seasonality.
Dependence: Climate events may be temporally dependent. If extremes are clustered (e.g., heatwaves during summer), you may need to adjust for this dependence.
Block-Maxima Approach:
For the following, there's computing the temperature level that is expected to occur, on avaregage, once evey 100 years. The return period $T$ represents the average number of years you would expected between events that exceed a particular extreme value.
The return level, return_level(T) function computes the quantile associated with $1 - \frac{1}{T}$ for A GEV distribution:
$$\text{return level} = \text{GEV}.ppf(1-\frac{1}{T}, \text{shape, location, scale})$$When $T = 100$, this quantile corresponds to $1 - \frac{1}{100} = 0.99$, conveying that there's a 99% probability that this extreme this extreeme value will not be reached in any given year; equivalently, a 1% probability it will. By setting $T = 100$ , there's observation of the $99$-percentile of extreme values (whether daily, monthly or yearly, depending on the applied block size). This percentile imples a 1% probability of occurence in any given year, being consistent with a 100-year event threshold.
Reloading the data to avoid "date" column issues:
import openmeteo_requests
import pandas as pd
import requests_cache
from retry_requests import retry
# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)
# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
"latitude": 16.7425,
"longitude": -62.1874,
"start_date": "1980-01-08",
"end_date": "2025-06-24",
"daily": ["temperature_2m_mean", "temperature_2m_max", "temperature_2m_min", "apparent_temperature_mean", "apparent_temperature_max", "apparent_temperature_min", "wind_speed_10m_max", "et0_fao_evapotranspiration", "rain_sum", "dew_point_2m_max", "dew_point_2m_min", "surface_pressure_max", "surface_pressure_min", "pressure_msl_max", "pressure_msl_min", "relative_humidity_2m_max", "relative_humidity_2m_min", "wet_bulb_temperature_2m_max", "wet_bulb_temperature_2m_min", "vapour_pressure_deficit_max", "soil_temperature_0_to_7cm_mean"],
"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)
# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")
# Process daily data. The order of variables needs to be the same as requested.
daily = response.Daily()
daily_temperature_2m_mean = daily.Variables(0).ValuesAsNumpy()
daily_temperature_2m_max = daily.Variables(1).ValuesAsNumpy()
daily_temperature_2m_min = daily.Variables(2).ValuesAsNumpy()
daily_apparent_temperature_mean = daily.Variables(3).ValuesAsNumpy()
daily_apparent_temperature_max = daily.Variables(4).ValuesAsNumpy()
daily_apparent_temperature_min = daily.Variables(5).ValuesAsNumpy()
daily_wind_speed_10m_max = daily.Variables(6).ValuesAsNumpy()
daily_et0_fao_evapotranspiration = daily.Variables(7).ValuesAsNumpy()
daily_rain_sum = daily.Variables(8).ValuesAsNumpy()
daily_dew_point_2m_max = daily.Variables(9).ValuesAsNumpy()
daily_dew_point_2m_min = daily.Variables(10).ValuesAsNumpy()
daily_surface_pressure_max = daily.Variables(11).ValuesAsNumpy()
daily_surface_pressure_min = daily.Variables(12).ValuesAsNumpy()
daily_pressure_msl_max = daily.Variables(13).ValuesAsNumpy()
daily_pressure_msl_min = daily.Variables(14).ValuesAsNumpy()
daily_relative_humidity_2m_max = daily.Variables(15).ValuesAsNumpy()
daily_relative_humidity_2m_min = daily.Variables(16).ValuesAsNumpy()
daily_wet_bulb_temperature_2m_max = daily.Variables(17).ValuesAsNumpy()
daily_wet_bulb_temperature_2m_min = daily.Variables(18).ValuesAsNumpy()
daily_vapour_pressure_deficit_max = daily.Variables(19).ValuesAsNumpy()
daily_soil_temperature_0_to_7cm_mean = daily.Variables(20).ValuesAsNumpy()
daily_data = {"date": pd.date_range(
start = pd.to_datetime(daily.Time(), unit = "s", utc = True),
end = pd.to_datetime(daily.TimeEnd(), unit = "s", utc = True),
freq = pd.Timedelta(seconds = daily.Interval()),
inclusive = "left"
)}
daily_data["temperature_2m_mean"] = daily_temperature_2m_mean
daily_data["temperature_2m_max"] = daily_temperature_2m_max
daily_data["temperature_2m_min"] = daily_temperature_2m_min
daily_data["apparent_temperature_mean"] = daily_apparent_temperature_mean
daily_data["apparent_temperature_max"] = daily_apparent_temperature_max
daily_data["apparent_temperature_min"] = daily_apparent_temperature_min
daily_data["wind_speed_10m_max"] = daily_wind_speed_10m_max
daily_data["et0_fao_evapotranspiration"] = daily_et0_fao_evapotranspiration
daily_data["rain_sum"] = daily_rain_sum
daily_data["dew_point_2m_max"] = daily_dew_point_2m_max
daily_data["dew_point_2m_min"] = daily_dew_point_2m_min
daily_data["surface_pressure_max"] = daily_surface_pressure_max
daily_data["surface_pressure_min"] = daily_surface_pressure_min
daily_data["pressure_msl_max"] = daily_pressure_msl_max
daily_data["pressure_msl_min"] = daily_pressure_msl_min
daily_data["relative_humidity_2m_max"] = daily_relative_humidity_2m_max
daily_data["relative_humidity_2m_min"] = daily_relative_humidity_2m_min
daily_data["wet_bulb_temperature_2m_max"] = daily_wet_bulb_temperature_2m_max
daily_data["wet_bulb_temperature_2m_min"] = daily_wet_bulb_temperature_2m_min
daily_data["vapour_pressure_deficit_max"] = daily_vapour_pressure_deficit_max
daily_data["soil_temperature_0_to_7cm_mean"] = daily_soil_temperature_0_to_7cm_mean
daily_dataframe = pd.DataFrame(data = daily_data)
print(daily_dataframe)
daily_dataframe = daily_dataframe.copy()
goody_frame = daily_dataframe.dropna()
goody_frame.info()
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
date temperature_2m_mean temperature_2m_max \
0 1980-01-08 04:00:00+00:00 23.374834 24.141499
1 1980-01-09 04:00:00+00:00 23.264421 23.891499
2 1980-01-10 04:00:00+00:00 22.322748 23.191502
3 1980-01-11 04:00:00+00:00 22.587332 23.341499
4 1980-01-12 04:00:00+00:00 21.306086 22.091499
... ... ... ...
16600 2025-06-20 04:00:00+00:00 25.351082 26.199001
16601 2025-06-21 04:00:00+00:00 25.390665 25.898998
16602 2025-06-22 04:00:00+00:00 25.317749 25.898998
16603 2025-06-23 04:00:00+00:00 NaN 25.848999
16604 2025-06-24 04:00:00+00:00 NaN NaN
temperature_2m_min apparent_temperature_mean \
0 22.191502 22.092840
1 22.191502 22.358231
2 21.341499 21.067259
3 21.841499 19.905577
4 20.541500 19.145449
... ... ...
16600 24.848999 25.104864
16601 24.699001 25.419016
16602 24.449001 24.848602
16603 25.098999 NaN
16604 NaN NaN
apparent_temperature_max apparent_temperature_min wind_speed_10m_max \
0 23.520189 20.983297 37.212578
1 23.697132 21.602598 36.896046
2 22.371422 19.988932 35.654541
3 20.436180 18.984425 42.072281
4 19.637054 18.262983 40.104061
... ... ... ...
16600 27.231419 23.766788 40.882591
16601 27.573139 24.278919 38.166790
16602 26.219694 23.004978 44.039349
16603 25.357843 23.626095 42.990990
16604 NaN NaN NaN
et0_fao_evapotranspiration rain_sum ... surface_pressure_max \
0 3.982460 1.5 ... 983.794922
1 3.946293 0.8 ... 984.397400
2 3.259691 2.7 ... 983.913513
3 4.604709 0.5 ... 983.572449
4 2.766571 5.7 ... 982.082092
... ... ... ... ...
16600 4.981394 0.1 ... 983.506775
16601 5.119689 0.0 ... 983.344971
16602 5.130907 1.0 ... 982.319397
16603 NaN NaN ... 981.898865
16604 NaN NaN ... NaN
surface_pressure_min pressure_msl_max pressure_msl_min \
0 980.577454 1019.299988 1016.099976
1 981.443359 1019.900024 1016.900024
2 980.805786 1019.599976 1016.299988
3 980.355164 1019.099976 1015.900024
4 978.976501 1017.799988 1014.599976
... ... ... ...
16600 981.255981 1018.700012 1016.500000
16601 980.240479 1018.700012 1015.400024
16602 979.411743 1017.500000 1014.500000
16603 979.643860 1017.099976 1014.799988
16604 NaN NaN NaN
relative_humidity_2m_max relative_humidity_2m_min \
0 87.652779 70.725937
1 87.906815 73.156029
2 90.619431 71.578697
3 81.800613 61.149487
4 89.427284 78.321884
... ... ...
16600 86.541199 70.866669
16601 85.219734 72.591751
16602 86.767601 72.591751
16603 84.229759 75.320984
16604 NaN NaN
wet_bulb_temperature_2m_max wet_bulb_temperature_2m_min \
0 21.027277 20.169138
1 20.914402 20.337797
2 20.636232 18.998484
3 19.724335 17.843048
4 19.959215 19.202456
... ... ...
16600 23.118631 21.683819
16601 22.751518 22.099451
16602 22.906918 21.904879
16603 23.149427 22.411777
16604 NaN NaN
vapour_pressure_deficit_max soil_temperature_0_to_7cm_mean
0 0.880710 24.816500
1 0.795568 24.729010
2 0.783625 24.678999
3 1.107534 24.629000
4 0.576288 24.578997
... ... ...
16600 0.984500 26.217749
16601 0.912614 26.238586
16602 0.912614 26.267754
16603 0.821694 NaN
16604 NaN NaN
[16605 rows x 22 columns]
<class 'pandas.core.frame.DataFrame'>
Index: 16603 entries, 0 to 16602
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 16603 non-null datetime64[ns, UTC]
1 temperature_2m_mean 16603 non-null float32
2 temperature_2m_max 16603 non-null float32
3 temperature_2m_min 16603 non-null float32
4 apparent_temperature_mean 16603 non-null float32
5 apparent_temperature_max 16603 non-null float32
6 apparent_temperature_min 16603 non-null float32
7 wind_speed_10m_max 16603 non-null float32
8 et0_fao_evapotranspiration 16603 non-null float32
9 rain_sum 16603 non-null float32
10 dew_point_2m_max 16603 non-null float32
11 dew_point_2m_min 16603 non-null float32
12 surface_pressure_max 16603 non-null float32
13 surface_pressure_min 16603 non-null float32
14 pressure_msl_max 16603 non-null float32
15 pressure_msl_min 16603 non-null float32
16 relative_humidity_2m_max 16603 non-null float32
17 relative_humidity_2m_min 16603 non-null float32
18 wet_bulb_temperature_2m_max 16603 non-null float32
19 wet_bulb_temperature_2m_min 16603 non-null float32
20 vapour_pressure_deficit_max 16603 non-null float32
21 soil_temperature_0_to_7cm_mean 16603 non-null float32
dtypes: datetime64[ns, UTC](1), float32(21)
memory usage: 1.6 MB
EVA computation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import genextreme, genpareto
# --- Prepare data ---
goody_frame['date'] = pd.to_datetime(goody_frame['date'])
goody_frame = goody_frame.set_index('date').sort_index()
# --- Constants ---
T = 100 # Return period in years
threshold_quantile = 0.95
years = (goody_frame.index.max() - goody_frame.index.min()).days / 365.25
# --- Loop over variables ---
for col in goody_frame.columns:
print(f"\n==========================")
print(f"📈 Analyzing variable: {col}")
print(f"==========================")
data = goody_frame[col].dropna()
# -----------------------------
# BLOCK MAXIMA + GEV
# -----------------------------
block_max = data.resample('YE').max().dropna()
c, loc_gev, scale_gev = genextreme.fit(block_max)
# Return Level for 100-year event (GEV)
if c != 0:
z_gev = loc_gev + (scale_gev / c) * ((-np.log(1 - 1/T))**(-c) - 1)
else:
z_gev = loc_gev - scale_gev * np.log(-np.log(1 - 1/T))
# Plot GEV
x_gev = np.linspace(block_max.min(), block_max.max(), 100)
pdf_gev = genextreme.pdf(x_gev, c, loc=loc_gev, scale=scale_gev)
plt.figure(figsize=(10, 4))
plt.hist(block_max, bins=10, density=True, alpha=0.5, label='Block Maxima')
plt.plot(x_gev, pdf_gev, 'r-', label='GEV Fit')
plt.axvline(z_gev, color='k', linestyle='--', label=f'100-yr RL = {z_gev:.2f}')
plt.title(f"{col} - GEV (Block Maxima)")
plt.xlabel(col)
plt.ylabel("Density")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# -----------------------------
# POT + GPD
# -----------------------------
threshold = data.quantile(threshold_quantile)
exceedances = data[data > threshold] - threshold
exceedances = exceedances.dropna()
if exceedances.empty:
print("⚠️ No exceedances above threshold — skipping POT analysis.")
continue
shape_gpd, loc_gpd, scale_gpd = genpareto.fit(exceedances)
num_exceed = exceedances.shape[0]
n = num_exceed / years # exceedances per year
# Return Level for 100-year event (GPD)
if shape_gpd != 0:
z_gpd = threshold + (scale_gpd / shape_gpd) * ((T * n)**shape_gpd - 1)
else:
z_gpd = threshold + scale_gpd * np.log(T * n)
# Plot GPD
x_gpd = np.linspace(0, exceedances.max(), 100)
pdf_gpd = genpareto.pdf(x_gpd, shape_gpd, loc=loc_gpd, scale=scale_gpd)
plt.figure(figsize=(10, 4))
plt.hist(exceedances, bins=20, density=True, alpha=0.5, label='Exceedances')
plt.plot(x_gpd, pdf_gpd, 'r-', label='GPD Fit')
plt.axvline(z_gpd - threshold, color='k', linestyle='--', label=f'100-yr RL = {z_gpd:.2f}')
plt.title(f"{col} - GPD (Peaks Over Threshold)")
plt.xlabel(f"{col} exceedances over {threshold:.2f}")
plt.ylabel("Density")
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
# -----------------------------
# Summary Printout
# -----------------------------
print("📊 GEV Fit:")
print(f" Shape: {c:.4f}")
print(f" Location: {loc_gev:.2f}")
print(f" Scale: {scale_gev:.2f}")
print(f" 🎯 100-year Return Level (GEV): {z_gev:.2f}")
print("\n📊 GPD Fit:")
print(f" Threshold: {threshold:.2f}")
print(f" Shape: {shape_gpd:.4f}")
print(f" Location: {loc_gpd:.2f}")
print(f" Scale: {scale_gpd:.2f}")
print(f" 🎯 100-year Return Level (GPD): {z_gpd:.2f}")
C:\Users\verlene\AppData\Local\Temp\ipykernel_10952\2125509235.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy goody_frame['date'] = pd.to_datetime(goody_frame['date'])
========================== 📈 Analyzing variable: temperature_2m_mean ==========================
📊 GEV Fit: Shape: 0.0523 Location: 26.01 Scale: 0.50 🎯 100-year Return Level (GEV): 28.62 📊 GPD Fit: Threshold: 26.04 Shape: -0.2914 Location: 0.00 Scale: 0.57 🎯 100-year Return Level (GPD): 27.77 ========================== 📈 Analyzing variable: temperature_2m_max ==========================
📊 GEV Fit: Shape: -0.3138 Location: 26.50 Scale: 0.58 🎯 100-year Return Level (GEV): 27.90 📊 GPD Fit: Threshold: 27.55 Shape: -0.3587 Location: 0.05 Scale: 0.80 🎯 100-year Return Level (GPD): 29.63 ========================== 📈 Analyzing variable: temperature_2m_min ==========================
📊 GEV Fit: Shape: 0.1078 Location: 25.53 Scale: 0.41 🎯 100-year Return Level (GEV): 27.99 📊 GPD Fit: Threshold: 25.10 Shape: 0.0217 Location: 0.04 Scale: 0.37 🎯 100-year Return Level (GPD): 28.15 ========================== 📈 Analyzing variable: apparent_temperature_mean ==========================
📊 GEV Fit: Shape: 0.0894 Location: 29.02 Scale: 0.78 🎯 100-year Return Level (GEV): 33.45 📊 GPD Fit: Threshold: 28.06 Shape: -0.1756 Location: 0.00 Scale: 0.90 🎯 100-year Return Level (GPD): 31.81 ========================== 📈 Analyzing variable: apparent_temperature_max ==========================
📊 GEV Fit: Shape: 0.2380 Location: 31.57 Scale: 1.16 🎯 100-year Return Level (GEV): 41.24 📊 GPD Fit: Threshold: 30.48 Shape: -0.2249 Location: 0.00 Scale: 1.16 🎯 100-year Return Level (GPD): 34.70 ========================== 📈 Analyzing variable: apparent_temperature_min ==========================
📊 GEV Fit: Shape: 0.0871 Location: 27.74 Scale: 0.76 🎯 100-year Return Level (GEV): 32.07 📊 GPD Fit: Threshold: 26.58 Shape: -0.1257 Location: 0.00 Scale: 0.82 🎯 100-year Return Level (GPD): 30.59 ========================== 📈 Analyzing variable: wind_speed_10m_max ==========================
📊 GEV Fit: Shape: -0.2789 Location: 49.74 Scale: 5.40 🎯 100-year Return Level (GEV): 63.74 📊 GPD Fit: Threshold: 40.51 Shape: 0.2245 Location: 0.00 Scale: 2.65 🎯 100-year Return Level (GPD): 92.39 ========================== 📈 Analyzing variable: et0_fao_evapotranspiration ==========================
📊 GEV Fit: Shape: -0.1624 Location: 5.93 Scale: 0.28 🎯 100-year Return Level (GEV): 6.82 📊 GPD Fit: Threshold: 5.61 Shape: -0.1881 Location: 0.00 Scale: 0.38 🎯 100-year Return Level (GPD): 7.13 ========================== 📈 Analyzing variable: rain_sum ==========================
📊 GEV Fit: Shape: -0.3210 Location: 34.74 Scale: 18.52 🎯 100-year Return Level (GEV): 79.26 📊 GPD Fit: Threshold: 8.40 Shape: 0.3871 Location: 0.00 Scale: 5.36 🎯 100-year Return Level (GPD): 247.29 ========================== 📈 Analyzing variable: dew_point_2m_max ==========================
📊 GEV Fit: Shape: 0.1250 Location: 23.06 Scale: 0.35 🎯 100-year Return Level (GEV): 25.20 📊 GPD Fit: Threshold: 22.79 Shape: -0.2295 Location: 0.01 Scale: 0.42 🎯 100-year Return Level (GPD): 24.28 ========================== 📈 Analyzing variable: dew_point_2m_min ==========================
📊 GEV Fit: Shape: 0.3433 Location: 22.17 Scale: 0.38 🎯 100-year Return Level (GEV): 26.48 📊 GPD Fit: Threshold: 21.79 Shape: -0.1907 Location: 0.01 Scale: 0.33 🎯 100-year Return Level (GPD): 23.12 ========================== 📈 Analyzing variable: surface_pressure_max ==========================
📊 GEV Fit: Shape: 0.3107 Location: 985.25 Scale: 0.70 🎯 100-year Return Level (GEV): 992.42 📊 GPD Fit: Threshold: 983.99 Shape: -0.1512 Location: 0.00 Scale: 0.68 🎯 100-year Return Level (GPD): 987.04 ========================== 📈 Analyzing variable: surface_pressure_min ==========================
📊 GEV Fit: Shape: 0.3398 Location: 982.21 Scale: 0.71 🎯 100-year Return Level (GEV): 990.08 📊 GPD Fit: Threshold: 981.03 Shape: -0.1593 Location: 0.00 Scale: 0.65 🎯 100-year Return Level (GPD): 983.90 ========================== 📈 Analyzing variable: pressure_msl_max ==========================
📊 GEV Fit: Shape: 0.3270 Location: 1020.82 Scale: 0.72 🎯 100-year Return Level (GEV): 1028.54 📊 GPD Fit: Threshold: 1019.50 Shape: -0.0520 Location: 0.10 Scale: 0.59 🎯 100-year Return Level (GPD): 1023.14 ========================== 📈 Analyzing variable: pressure_msl_min ==========================
📊 GEV Fit: Shape: 0.5000 Location: 1017.71 Scale: 0.80 🎯 100-year Return Level (GEV): 1032.10 📊 GPD Fit: Threshold: 1016.40 Shape: -0.1092 Location: 0.10 Scale: 0.61 🎯 100-year Return Level (GPD): 1019.48 ========================== 📈 Analyzing variable: relative_humidity_2m_max ==========================
📊 GEV Fit: Shape: 0.0249 Location: 92.27 Scale: 0.77 🎯 100-year Return Level (GEV): 96.00 📊 GPD Fit: Threshold: 90.96 Shape: -0.0197 Location: 0.00 Scale: 0.55 🎯 100-year Return Level (GPD): 94.78 ========================== 📈 Analyzing variable: relative_humidity_2m_min ==========================
📊 GEV Fit: Shape: 0.3537 Location: 83.46 Scale: 1.53 🎯 100-year Return Level (GEV): 101.15 📊 GPD Fit: Threshold: 80.83 Shape: -0.1529 Location: 0.00 Scale: 1.34 🎯 100-year Return Level (GPD): 86.82 ========================== 📈 Analyzing variable: wet_bulb_temperature_2m_max ==========================
📊 GEV Fit: Shape: 0.0392 Location: 23.68 Scale: 0.36 🎯 100-year Return Level (GEV): 25.50 📊 GPD Fit: Threshold: 23.57 Shape: -0.2716 Location: 0.00 Scale: 0.45 🎯 100-year Return Level (GPD): 25.01 ========================== 📈 Analyzing variable: wet_bulb_temperature_2m_min ==========================
📊 GEV Fit: Shape: 0.1621 Location: 23.05 Scale: 0.35 🎯 100-year Return Level (GEV): 25.44 📊 GPD Fit: Threshold: 22.78 Shape: -0.1442 Location: 0.00 Scale: 0.33 🎯 100-year Return Level (GPD): 24.29 ========================== 📈 Analyzing variable: vapour_pressure_deficit_max ==========================
📊 GEV Fit: Shape: -0.4890 Location: 1.32 Scale: 0.11 🎯 100-year Return Level (GEV): 1.53 📊 GPD Fit: Threshold: 1.42 Shape: -0.1940 Location: 0.00 Scale: 0.20 🎯 100-year Return Level (GPD): 2.20 ========================== 📈 Analyzing variable: soil_temperature_0_to_7cm_mean ==========================
📊 GEV Fit: Shape: -0.7067 Location: 26.99 Scale: 0.69 🎯 100-year Return Level (GEV): 27.93 📊 GPD Fit: Threshold: 29.88 Shape: -0.3008 Location: 0.00 Scale: 1.82 🎯 100-year Return Level (GPD): 35.30
Outlier Detection¶
Examining the Data:
Outliers will be treated as extreme events rather than cases to disregard; there's not much depth in extensive climate variability research if extreme events are disregarded in data.
Remarks on the Outlier Detector Algorithms¶
Outliers are not things just to be casted away just because they don't fit ideal models. Outliers in this development are treated as "extreme" weather events in relation to climate variability. If weather states don't fall outside of the "ordinary", then climate change studies would have no value.
Outlier Detection by Local Outlier Factor Method¶
The Local Outlier Factor (LOF) is an unsupervised learning algorithm designed to identify anomalous data points within a dataset. It operates by comparing the local density of a data point to the density of its neighbors. It provides a convenient and efficient way to detect outliers in various applications.LOF is classified as an unsupervised learning algorithm because it does not require labeled data to identify outliers. Unlike supervised learning, which involves training a model on labeled examples to make predictions, LOF relies solely on the inherent structure and distribution of the data itself. This makes it particularly useful in scenarios where labeled data is scarce or unavailable.
For multivariate data sets The LOF algorithm does not analyse each attribute independently when determining if a row is an outlier. Instead, it considers all attributes (features) collectively and assesses how the entire row (i.e., the combination of all feature values) deviates from its neighboring rows in the multi-dimensional feature space. It's characteristics:
Multi-Dimensional Analysis: LOF operates in a multi-dimensional space where each dimension corresponds to one attribute (e.g., temperature, relative humidity, pressure, etc.). It doesn't check individual attributes separately but looks at the combined values of all attributes for each row.
Density-Based Approach: The algorithm calculates the local density of each data point by comparing it to its nearest neighbors. It then determines whether a point is an outlier based on the relative density compared to the densities of its neighbors. If the density of a point is significantly lower than that of its neighbors, the point is considered an outlier.
Outlier Score for the Entire Row: LOF assigns an outlier score based on how the row's multi-dimensional position and density differ from those of nearby rows. It labels a row as an outlier only if its overall pattern (considering all attributes together) deviates significantly from its neighbors.
Mathematical Structure of LOF¶
For dataset $X = {x_1,x_2,...,x_n}$ in a $r$-dimensional space, let $k$ be the number of nearest neighbors considered for the outlier detection.
For each data point $x_i$, compute the distance to all other points. Such distance can be computed by applying any suitable metric (Eucluidean distance commonly used however):
K-Nearest Neigbors(k-NN): For each point $x_i$ find its $k$ nearest neigbors $N_k(x_i)$ based on the distance calculated.
Reachability Distance: The reachability distance $d_reach(x_i\,x_j)$ from point $x_i$ to point $x_j$ is defined as:
where $kNNDist(x_j)$ is the distance from $x_j$ to its $k$-th nearest neighbor, ensuring that the reachability distance accounts for the local density.
- Local Reachability Density (LRD): The LRD $\rho_{\text{L}}(x_i)$ for a point $x_i$ is computed as the inverse of the average reachability distance to its $k$ nearest neighbors:
- Local Outlier Factor: the LOF score, $LOF(x_i)$ for a point $x_i$ is defined as the ratio of the local reachability density of $x_i$ to the average local reachability density of its $k$ nearest neighbors:
If $LOF(x_i) < 1$: $x_i$ is considered a normal point;
If $LOF(x_i) > 1$: $x_i$ is considered an outlier, where higher values suggest stronger outlier
Visualizations for Local Outlier Factor:
- Scatter Plot with Outliers Highlighted:
Visualizes the dataset and highlights the outliers identified by LOF.
- Decision Boundary Visualization:
Shows the regions where LOF considers points as outliers or inliers.
The following is a demonstration of LOF.
from sklearn.neighbors import LocalOutlierFactor
# Generate synthetic data
np.random.seed(42)
X_inliers = 0.3 * np.random.randn(100, 2)
X_outliers = np.random.uniform(low=-4, high=4, size=(20, 2))
X = np.concatenate([X_inliers, X_outliers], axis=0)
# Fit the LOF model
lof = LocalOutlierFactor(n_neighbors=20)
y_pred = lof.fit_predict(X)
outlier_scores = -lof.negative_outlier_factor_
# 1. Scatter Plot with Outliers Highlighted
plt.figure(figsize=(10, 6))
plt.scatter(X[:, 0], X[:, 1], color='blue', s=20, label='Inliers')
plt.scatter(X[y_pred == -1, 0], X[y_pred == -1, 1], color='red', s=50, edgecolor='k', label='Outliers')
plt.title('Local Outlier Factor (LOF) - Outliers Detection')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
# 2. Decision Boundary Visualization
xx, yy = np.meshgrid(np.linspace(-5, 5, 100), np.linspace(-5, 5, 100))
grid_points = np.c_[xx.ravel(), yy.ravel()]
# Manually compute the nearest neighbors for grid points
distances = np.linalg.norm(grid_points[:, np.newaxis] - X, axis=2)
# Get indices of the nearest neighbors
neighbors = np.argsort(distances, axis=1)[:, :lof.n_neighbors]
# Compute outlier scores for grid points
outlier_scores_grid = np.array([
-np.mean(lof.negative_outlier_factor_[neighbors[i]]) for i in range(grid_points.shape[0])
])
# Reshape the scores for contour plotting
Z = outlier_scores_grid.reshape(xx.shape)
# Plotting the decision boundary
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, levels=np.linspace(Z.min(), Z.max(), 7), cmap=plt.cm.Blues_r)
plt.colorbar()
plt.scatter(X[:, 0], X[:, 1], c='white', s=20, edgecolor='k')
plt.scatter(X[y_pred == -1, 0], X[y_pred == -1, 1], color='red', s=50, edgecolor='k', label='Outliers')
plt.title('LOF Decision Boundary and Outliers')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.show()
Explanation of Visuals:
- Scatter Plot with Outliers Highlighted:
This plot shows all data points, with blue points representing inliers and red points (larger markers) indicating the detected outliers based on LOF. The algorithm calculates the local density of each point and identifies those with significantly lower density compared to their neighbors as outliers.
- Decision Boundary Visualization:
The decision boundary plot shows the regions where LOF detects outliers. The background color gradient represents the level of LOF scores, with darker shades indicating areas with higher anomaly scores. Points labeled as outliers are plotted in red.
For data neighours a choice of 30-50 can help reduce the impact of seasonal intersections on outlier detection, allowing the model to differentiate between genuine outliers and typical seasonal variability.
The DataFrame generated will include a column 'LOF', which indicates outliers with -1, inliers with 1, and one can observe the outliers displayed separately beneath.
For data neighours a choice of 30-50 can help reduce the impact of seasonal intersections on outlier detection, allowing the model to differentiate between genuine outliers and typical seasonal variability.
Physical attributes in meteorology are generally coupled in weather dynamics, so outlier detection upon a sole attribute attribute may not have much meaning.
Binary Designation in Code¶
The LOF algorithm will be subjugated to a binary grouping (not representing actual LOF values, but grouping the range of LOF values into binary):
A LOF value of 1 means that the observation is considered an inlier (not an outlier). It indicates that the data point is in a "normal" range compared to its neighbors.
A LOF value of -1 indicates that the observation is considered an outlier. This means that this data point is significantly different from the rest of the data, based on its local density compared to its neighbors.
goody_frame.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 16603 entries, 1980-01-08 04:00:00+00:00 to 2025-06-22 04:00:00+00:00 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 temperature_2m_mean 16603 non-null float32 1 temperature_2m_max 16603 non-null float32 2 temperature_2m_min 16603 non-null float32 3 apparent_temperature_mean 16603 non-null float32 4 apparent_temperature_max 16603 non-null float32 5 apparent_temperature_min 16603 non-null float32 6 wind_speed_10m_max 16603 non-null float32 7 et0_fao_evapotranspiration 16603 non-null float32 8 rain_sum 16603 non-null float32 9 dew_point_2m_max 16603 non-null float32 10 dew_point_2m_min 16603 non-null float32 11 surface_pressure_max 16603 non-null float32 12 surface_pressure_min 16603 non-null float32 13 pressure_msl_max 16603 non-null float32 14 pressure_msl_min 16603 non-null float32 15 relative_humidity_2m_max 16603 non-null float32 16 relative_humidity_2m_min 16603 non-null float32 17 wet_bulb_temperature_2m_max 16603 non-null float32 18 wet_bulb_temperature_2m_min 16603 non-null float32 19 vapour_pressure_deficit_max 16603 non-null float32 20 soil_temperature_0_to_7cm_mean 16603 non-null float32 dtypes: float32(21) memory usage: 1.5 MB
from sklearn.neighbors import LocalOutlierFactor
# Select all numeric columns in the DataFrame
columns = goody_frame.select_dtypes(include='number').columns.tolist()
# Initialize the LOF model
lof = LocalOutlierFactor(n_neighbors=30)
# Fit the model and predict outliers
goody_frame['LOF'] = lof.fit_predict(goody_frame[columns])
# Extract the outliers
outliers = goody_frame[goody_frame['LOF'] == -1]
# Display the original DataFrame with LOF results
print("DataFrame with LOF results:")
print(goody_frame)
print("\nOutliers:")
print(outliers)
# Count how often outliers occur in each column (just for display purposes here, since LOF is multivariate)
outlier_frequencies = {}
for col in columns:
outlier_frequencies[col] = (outliers[col].notna()).sum()
# Display the outlier frequencies
print("\nOutlier frequencies per column:")
for col, freq in outlier_frequencies.items():
print(f"{col}: {freq} outliers")
DataFrame with LOF results:
temperature_2m_mean temperature_2m_max \
date
1980-01-08 04:00:00+00:00 23.374834 24.141499
1980-01-09 04:00:00+00:00 23.264421 23.891499
1980-01-10 04:00:00+00:00 22.322748 23.191502
1980-01-11 04:00:00+00:00 22.587332 23.341499
1980-01-12 04:00:00+00:00 21.306086 22.091499
... ... ...
2025-06-18 04:00:00+00:00 25.536499 26.148998
2025-06-19 04:00:00+00:00 25.476080 26.049000
2025-06-20 04:00:00+00:00 25.351082 26.199001
2025-06-21 04:00:00+00:00 25.390665 25.898998
2025-06-22 04:00:00+00:00 25.317749 25.898998
temperature_2m_min apparent_temperature_mean \
date
1980-01-08 04:00:00+00:00 22.191502 22.092840
1980-01-09 04:00:00+00:00 22.191502 22.358231
1980-01-10 04:00:00+00:00 21.341499 21.067259
1980-01-11 04:00:00+00:00 21.841499 19.905577
1980-01-12 04:00:00+00:00 20.541500 19.145449
... ... ...
2025-06-18 04:00:00+00:00 24.799000 24.706778
2025-06-19 04:00:00+00:00 24.749001 24.506287
2025-06-20 04:00:00+00:00 24.848999 25.104864
2025-06-21 04:00:00+00:00 24.699001 25.419016
2025-06-22 04:00:00+00:00 24.449001 24.848602
apparent_temperature_max apparent_temperature_min \
date
1980-01-08 04:00:00+00:00 23.520189 20.983297
1980-01-09 04:00:00+00:00 23.697132 21.602598
1980-01-10 04:00:00+00:00 22.371422 19.988932
1980-01-11 04:00:00+00:00 20.436180 18.984425
1980-01-12 04:00:00+00:00 19.637054 18.262983
... ... ...
2025-06-18 04:00:00+00:00 26.624802 23.568531
2025-06-19 04:00:00+00:00 25.319586 23.481770
2025-06-20 04:00:00+00:00 27.231419 23.766788
2025-06-21 04:00:00+00:00 27.573139 24.278919
2025-06-22 04:00:00+00:00 26.219694 23.004978
wind_speed_10m_max et0_fao_evapotranspiration \
date
1980-01-08 04:00:00+00:00 37.212578 3.982460
1980-01-09 04:00:00+00:00 36.896046 3.946293
1980-01-10 04:00:00+00:00 35.654541 3.259691
1980-01-11 04:00:00+00:00 42.072281 4.604709
1980-01-12 04:00:00+00:00 40.104061 2.766571
... ... ...
2025-06-18 04:00:00+00:00 41.411346 5.478891
2025-06-19 04:00:00+00:00 44.003281 5.058496
2025-06-20 04:00:00+00:00 40.882591 4.981394
2025-06-21 04:00:00+00:00 38.166790 5.119689
2025-06-22 04:00:00+00:00 44.039349 5.130907
rain_sum dew_point_2m_max ... \
date ...
1980-01-08 04:00:00+00:00 1.5 20.241501 ...
1980-01-09 04:00:00+00:00 0.8 20.141499 ...
1980-01-10 04:00:00+00:00 2.7 20.141499 ...
1980-01-11 04:00:00+00:00 0.5 18.691502 ...
1980-01-12 04:00:00+00:00 5.7 19.341499 ...
... ... ... ...
2025-06-18 04:00:00+00:00 0.1 21.949001 ...
2025-06-19 04:00:00+00:00 0.3 22.299000 ...
2025-06-20 04:00:00+00:00 0.1 22.449001 ...
2025-06-21 04:00:00+00:00 0.0 22.049000 ...
2025-06-22 04:00:00+00:00 1.0 22.148998 ...
surface_pressure_min pressure_msl_max \
date
1980-01-08 04:00:00+00:00 980.577454 1019.299988
1980-01-09 04:00:00+00:00 981.443359 1019.900024
1980-01-10 04:00:00+00:00 980.805786 1019.599976
1980-01-11 04:00:00+00:00 980.355164 1019.099976
1980-01-12 04:00:00+00:00 978.976501 1017.799988
... ... ...
2025-06-18 04:00:00+00:00 981.819763 1019.000000
2025-06-19 04:00:00+00:00 981.603394 1018.500000
2025-06-20 04:00:00+00:00 981.255981 1018.700012
2025-06-21 04:00:00+00:00 980.240479 1018.700012
2025-06-22 04:00:00+00:00 979.411743 1017.500000
pressure_msl_min relative_humidity_2m_max \
date
1980-01-08 04:00:00+00:00 1016.099976 87.652779
1980-01-09 04:00:00+00:00 1016.900024 87.906815
1980-01-10 04:00:00+00:00 1016.299988 90.619431
1980-01-11 04:00:00+00:00 1015.900024 81.800613
1980-01-12 04:00:00+00:00 1014.599976 89.427284
... ... ...
2025-06-18 04:00:00+00:00 1017.000000 83.175407
2025-06-19 04:00:00+00:00 1016.799988 83.486443
2025-06-20 04:00:00+00:00 1016.500000 86.541199
2025-06-21 04:00:00+00:00 1015.400024 85.219734
2025-06-22 04:00:00+00:00 1014.500000 86.767601
relative_humidity_2m_min \
date
1980-01-08 04:00:00+00:00 70.725937
1980-01-09 04:00:00+00:00 73.156029
1980-01-10 04:00:00+00:00 71.578697
1980-01-11 04:00:00+00:00 61.149487
1980-01-12 04:00:00+00:00 78.321884
... ...
2025-06-18 04:00:00+00:00 69.789970
2025-06-19 04:00:00+00:00 72.850510
2025-06-20 04:00:00+00:00 70.866669
2025-06-21 04:00:00+00:00 72.591751
2025-06-22 04:00:00+00:00 72.591751
wet_bulb_temperature_2m_max \
date
1980-01-08 04:00:00+00:00 21.027277
1980-01-09 04:00:00+00:00 20.914402
1980-01-10 04:00:00+00:00 20.636232
1980-01-11 04:00:00+00:00 19.724335
1980-01-12 04:00:00+00:00 19.959215
... ...
2025-06-18 04:00:00+00:00 22.869625
2025-06-19 04:00:00+00:00 23.097523
2025-06-20 04:00:00+00:00 23.118631
2025-06-21 04:00:00+00:00 22.751518
2025-06-22 04:00:00+00:00 22.906918
wet_bulb_temperature_2m_min \
date
1980-01-08 04:00:00+00:00 20.169138
1980-01-09 04:00:00+00:00 20.337797
1980-01-10 04:00:00+00:00 18.998484
1980-01-11 04:00:00+00:00 17.843048
1980-01-12 04:00:00+00:00 19.202456
... ...
2025-06-18 04:00:00+00:00 21.824770
2025-06-19 04:00:00+00:00 22.261038
2025-06-20 04:00:00+00:00 21.683819
2025-06-21 04:00:00+00:00 22.099451
2025-06-22 04:00:00+00:00 21.904879
vapour_pressure_deficit_max \
date
1980-01-08 04:00:00+00:00 0.880710
1980-01-09 04:00:00+00:00 0.795568
1980-01-10 04:00:00+00:00 0.783625
1980-01-11 04:00:00+00:00 1.107534
1980-01-12 04:00:00+00:00 0.576288
... ...
2025-06-18 04:00:00+00:00 1.023919
2025-06-19 04:00:00+00:00 0.914724
2025-06-20 04:00:00+00:00 0.984500
2025-06-21 04:00:00+00:00 0.912614
2025-06-22 04:00:00+00:00 0.912614
soil_temperature_0_to_7cm_mean LOF
date
1980-01-08 04:00:00+00:00 24.816500 1
1980-01-09 04:00:00+00:00 24.729010 1
1980-01-10 04:00:00+00:00 24.678999 1
1980-01-11 04:00:00+00:00 24.629000 1
1980-01-12 04:00:00+00:00 24.578997 1
... ... ...
2025-06-18 04:00:00+00:00 26.257332 1
2025-06-19 04:00:00+00:00 26.226084 1
2025-06-20 04:00:00+00:00 26.217749 1
2025-06-21 04:00:00+00:00 26.238586 1
2025-06-22 04:00:00+00:00 26.267754 1
[16603 rows x 22 columns]
Outliers:
temperature_2m_mean temperature_2m_max \
date
1984-11-09 04:00:00+00:00 23.899836 24.491501
1984-11-10 04:00:00+00:00 23.629000 24.191502
1984-12-16 04:00:00+00:00 22.676918 22.991501
1984-12-17 04:00:00+00:00 22.058168 22.491501
1985-03-06 04:00:00+00:00 21.545670 22.491501
... ... ...
2024-02-10 04:00:00+00:00 22.682335 23.549000
2024-06-24 04:00:00+00:00 26.503166 27.598999
2024-07-08 04:00:00+00:00 25.876083 26.999001
2024-07-13 04:00:00+00:00 26.430250 27.249001
2025-04-05 04:00:00+00:00 22.586496 24.049000
temperature_2m_min apparent_temperature_mean \
date
1984-11-09 04:00:00+00:00 22.941502 24.458384
1984-11-10 04:00:00+00:00 22.641499 23.820053
1984-12-16 04:00:00+00:00 22.391499 22.100792
1984-12-17 04:00:00+00:00 21.491501 21.047369
1985-03-06 04:00:00+00:00 20.491501 20.090147
... ... ...
2024-02-10 04:00:00+00:00 21.449001 21.389212
2024-06-24 04:00:00+00:00 23.699001 27.134459
2024-07-08 04:00:00+00:00 24.848999 26.651270
2024-07-13 04:00:00+00:00 24.598999 27.516558
2025-04-05 04:00:00+00:00 21.598999 20.925303
apparent_temperature_max apparent_temperature_min \
date
1984-11-09 04:00:00+00:00 26.100151 22.819016
1984-11-10 04:00:00+00:00 25.583670 22.599895
1984-12-16 04:00:00+00:00 23.416800 21.296741
1984-12-17 04:00:00+00:00 21.927549 19.866741
1985-03-06 04:00:00+00:00 21.959633 18.182652
... ... ...
2024-02-10 04:00:00+00:00 21.765350 20.778839
2024-06-24 04:00:00+00:00 28.154533 21.732914
2024-07-08 04:00:00+00:00 28.911919 24.626991
2024-07-13 04:00:00+00:00 29.844009 24.726059
2025-04-05 04:00:00+00:00 22.513348 19.282921
wind_speed_10m_max et0_fao_evapotranspiration \
date
1984-11-09 04:00:00+00:00 33.466450 3.518929
1984-11-10 04:00:00+00:00 27.238943 4.253847
1984-12-16 04:00:00+00:00 27.067116 3.873103
1984-12-17 04:00:00+00:00 29.070974 3.718415
1985-03-06 04:00:00+00:00 36.721764 3.856235
... ... ...
2024-02-10 04:00:00+00:00 37.226505 2.653172
2024-06-24 04:00:00+00:00 49.184483 3.909544
2024-07-08 04:00:00+00:00 38.647640 3.101306
2024-07-13 04:00:00+00:00 47.428551 4.825108
2025-04-05 04:00:00+00:00 41.330288 3.863715
rain_sum dew_point_2m_max ... \
date ...
1984-11-09 04:00:00+00:00 5.599999 21.741501 ...
1984-11-10 04:00:00+00:00 0.400000 19.641499 ...
1984-12-16 04:00:00+00:00 0.400000 18.691502 ...
1984-12-17 04:00:00+00:00 0.100000 18.241501 ...
1985-03-06 04:00:00+00:00 5.800000 18.841499 ...
... ... ... ...
2024-02-10 04:00:00+00:00 5.600000 19.348999 ...
2024-06-24 04:00:00+00:00 1.100000 23.549000 ...
2024-07-08 04:00:00+00:00 5.600000 23.848999 ...
2024-07-13 04:00:00+00:00 0.200000 23.949001 ...
2025-04-05 04:00:00+00:00 9.000000 20.199001 ...
surface_pressure_min pressure_msl_max \
date
1984-11-09 04:00:00+00:00 969.373352 1008.299988
1984-11-10 04:00:00+00:00 971.259155 1009.400024
1984-12-16 04:00:00+00:00 971.710815 1009.500000
1984-12-17 04:00:00+00:00 971.748718 1011.200012
1985-03-06 04:00:00+00:00 976.227295 1014.799988
... ... ...
2024-02-10 04:00:00+00:00 979.499207 1019.400024
2024-06-24 04:00:00+00:00 978.435974 1016.900024
2024-07-08 04:00:00+00:00 978.742493 1017.099976
2024-07-13 04:00:00+00:00 980.770020 1018.200012
2025-04-05 04:00:00+00:00 980.800781 1019.799988
pressure_msl_min relative_humidity_2m_max \
date
1984-11-09 04:00:00+00:00 1004.400024 91.277199
1984-11-10 04:00:00+00:00 1006.299988 81.392090
1984-12-16 04:00:00+00:00 1006.900024 78.879066
1984-12-17 04:00:00+00:00 1007.000000 80.762741
1985-03-06 04:00:00+00:00 1011.799988 85.932579
... ... ...
2024-02-10 04:00:00+00:00 1015.000000 84.616859
2024-06-24 04:00:00+00:00 1013.400024 92.999802
2024-07-08 04:00:00+00:00 1013.700012 86.671631
2024-07-13 04:00:00+00:00 1015.799988 89.528587
2025-04-05 04:00:00+00:00 1016.299988 86.528656
relative_humidity_2m_min \
date
1984-11-09 04:00:00+00:00 75.489159
1984-11-10 04:00:00+00:00 73.486008
1984-12-16 04:00:00+00:00 71.380264
1984-12-17 04:00:00+00:00 64.753418
1985-03-06 04:00:00+00:00 77.123642
... ...
2024-02-10 04:00:00+00:00 74.932289
2024-06-24 04:00:00+00:00 74.461945
2024-07-08 04:00:00+00:00 81.441521
2024-07-13 04:00:00+00:00 78.563728
2025-04-05 04:00:00+00:00 73.641228
wet_bulb_temperature_2m_max \
date
1984-11-09 04:00:00+00:00 22.308134
1984-11-10 04:00:00+00:00 20.816519
1984-12-16 04:00:00+00:00 19.954479
1984-12-17 04:00:00+00:00 19.500244
1985-03-06 04:00:00+00:00 19.741110
... ...
2024-02-10 04:00:00+00:00 20.353031
2024-06-24 04:00:00+00:00 24.447775
2024-07-08 04:00:00+00:00 24.511969
2024-07-13 04:00:00+00:00 24.641516
2025-04-05 04:00:00+00:00 20.940470
wet_bulb_temperature_2m_min \
date
1984-11-09 04:00:00+00:00 20.750200
1984-11-10 04:00:00+00:00 19.676367
1984-12-16 04:00:00+00:00 19.112356
1984-12-17 04:00:00+00:00 17.622261
1985-03-06 04:00:00+00:00 18.701254
... ...
2024-02-10 04:00:00+00:00 19.510832
2024-06-24 04:00:00+00:00 22.810537
2024-07-08 04:00:00+00:00 22.829180
2024-07-13 04:00:00+00:00 23.153746
2025-04-05 04:00:00+00:00 19.390827
vapour_pressure_deficit_max \
date
1984-11-09 04:00:00+00:00 0.730756
1984-11-10 04:00:00+00:00 0.773741
1984-12-16 04:00:00+00:00 0.796286
1984-12-17 04:00:00+00:00 0.942908
1985-03-06 04:00:00+00:00 0.613719
... ...
2024-02-10 04:00:00+00:00 0.725367
2024-06-24 04:00:00+00:00 0.942223
2024-07-08 04:00:00+00:00 0.661041
2024-07-13 04:00:00+00:00 0.774324
2025-04-05 04:00:00+00:00 0.788585
soil_temperature_0_to_7cm_mean LOF
date
1984-11-09 04:00:00+00:00 24.966499 -1
1984-11-10 04:00:00+00:00 25.066500 -1
1984-12-16 04:00:00+00:00 24.279005 -1
1984-12-17 04:00:00+00:00 24.229006 -1
1985-03-06 04:00:00+00:00 22.903999 -1
... ... ...
2024-02-10 04:00:00+00:00 25.515665 -1
2024-06-24 04:00:00+00:00 27.853163 -1
2024-07-08 04:00:00+00:00 27.490671 -1
2024-07-13 04:00:00+00:00 27.588583 -1
2025-04-05 04:00:00+00:00 25.355246 -1
[85 rows x 22 columns]
Outlier frequencies per column:
temperature_2m_mean: 85 outliers
temperature_2m_max: 85 outliers
temperature_2m_min: 85 outliers
apparent_temperature_mean: 85 outliers
apparent_temperature_max: 85 outliers
apparent_temperature_min: 85 outliers
wind_speed_10m_max: 85 outliers
et0_fao_evapotranspiration: 85 outliers
rain_sum: 85 outliers
dew_point_2m_max: 85 outliers
dew_point_2m_min: 85 outliers
surface_pressure_max: 85 outliers
surface_pressure_min: 85 outliers
pressure_msl_max: 85 outliers
pressure_msl_min: 85 outliers
relative_humidity_2m_max: 85 outliers
relative_humidity_2m_min: 85 outliers
wet_bulb_temperature_2m_max: 85 outliers
wet_bulb_temperature_2m_min: 85 outliers
vapour_pressure_deficit_max: 85 outliers
soil_temperature_0_to_7cm_mean: 85 outliers
Data Preservation: The first step involves creating a copy of the original dataframe, NYC_cntrprk_meteo_data, and storing it in a new dataframe named outlier_detect. This ensures that any modifications made to outlier_detect will not affect the original data.
LOF Model Application: The Local Outlier Factor (LOF) algorithm is then applied to outlier_detect. LOF is a statistical method used to identify outliers in a dataset by comparing the local density of a data point to the density of its neighbors. The results of the LOF analysis are stored in a new column named "LOF" within outlier_detect.
Outlier Flagging: The LOF column is used to flag outliers and inliers. Outliers are typically assigned a value of -1, while inliers are assigned a value of 1. This binary classification system simplifies the process of identifying and analyzing anomalous data points.
Outlier Frequency Calculation: The code calculates the frequency of outliers for each column in a specified list. This information can be valuable for understanding which columns are more prone to outliers and for determining appropriate outlier handling strategies.
Benefits of This Approach: By using outlier_detect as a working dataframe, the original structure of the dataset is preserved. This means that the original columns and their corresponding data types remain unchanged, ensuring that subsequent analyses and visualizations are based on the original data.
Normalizing the data to prevent differences in scale from affecting density calculations.
from scipy.stats import zscore
# Z-score normalization
goody_frame_normalized = goody_frame.copy()
goody_frame_normalized[columns] = goody_frame[columns].apply(zscore)
# Create a copy with a different name
goody_frame_zscore = goody_frame_normalized.copy()
print(goody_frame_zscore[columns])
temperature_2m_mean temperature_2m_max \
date
1980-01-08 04:00:00+00:00 -0.784183 -0.660480
1980-01-09 04:00:00+00:00 -0.880507 -0.844319
1980-01-10 04:00:00+00:00 -1.702019 -1.359068
1980-01-11 04:00:00+00:00 -1.471197 -1.248766
1980-01-12 04:00:00+00:00 -2.588952 -2.167964
... ... ...
2025-06-18 04:00:00+00:00 1.101646 0.815751
2025-06-19 04:00:00+00:00 1.048937 0.742217
2025-06-20 04:00:00+00:00 0.939889 0.852522
2025-06-21 04:00:00+00:00 0.974421 0.631912
2025-06-22 04:00:00+00:00 0.910809 0.631912
temperature_2m_min apparent_temperature_mean \
date
1980-01-08 04:00:00+00:00 -1.056235 -1.186370
1980-01-09 04:00:00+00:00 -1.056235 -1.065703
1980-01-10 04:00:00+00:00 -1.811791 -1.652678
1980-01-11 04:00:00+00:00 -1.367347 -2.180867
1980-01-12 04:00:00+00:00 -2.522900 -2.526479
... ... ...
2025-06-18 04:00:00+00:00 1.261536 0.002126
2025-06-19 04:00:00+00:00 1.217093 -0.089033
2025-06-20 04:00:00+00:00 1.305980 0.183126
2025-06-21 04:00:00+00:00 1.172649 0.325964
2025-06-22 04:00:00+00:00 0.950427 0.066610
apparent_temperature_max apparent_temperature_min \
date
1980-01-08 04:00:00+00:00 -1.182242 -1.125872
1980-01-09 04:00:00+00:00 -1.112033 -0.831592
1980-01-10 04:00:00+00:00 -1.638058 -1.598375
1980-01-11 04:00:00+00:00 -2.405936 -2.075698
1980-01-12 04:00:00+00:00 -2.723019 -2.418513
... ... ...
2025-06-18 04:00:00+00:00 0.049628 0.102582
2025-06-19 04:00:00+00:00 -0.468265 0.061354
2025-06-20 04:00:00+00:00 0.290326 0.196790
2025-06-21 04:00:00+00:00 0.425916 0.440145
2025-06-22 04:00:00+00:00 -0.111113 -0.165208
wind_speed_10m_max et0_fao_evapotranspiration \
date
1980-01-08 04:00:00+00:00 0.981832 -0.657839
1980-01-09 04:00:00+00:00 0.933995 -0.706750
1980-01-10 04:00:00+00:00 0.746366 -1.635291
1980-01-11 04:00:00+00:00 1.716280 0.183672
1980-01-12 04:00:00+00:00 1.418823 -2.302173
... ... ...
2025-06-18 04:00:00+00:00 1.616393 1.365891
2025-06-19 04:00:00+00:00 2.008112 0.797360
2025-06-20 04:00:00+00:00 1.536482 0.693091
2025-06-21 04:00:00+00:00 1.126042 0.880116
2025-06-22 04:00:00+00:00 2.013563 0.895287
rain_sum dew_point_2m_max ... \
date ...
1980-01-08 04:00:00+00:00 -0.121174 -0.434958 ...
1980-01-09 04:00:00+00:00 -0.265892 -0.502522 ...
1980-01-10 04:00:00+00:00 0.126914 -0.502522 ...
1980-01-11 04:00:00+00:00 -0.327914 -1.482185 ...
1980-01-12 04:00:00+00:00 0.747134 -1.043027 ...
... ... ... ...
2025-06-18 04:00:00+00:00 -0.410610 0.718683 ...
2025-06-19 04:00:00+00:00 -0.369262 0.955153 ...
2025-06-20 04:00:00+00:00 -0.410610 1.056498 ...
2025-06-21 04:00:00+00:00 -0.431284 0.786245 ...
2025-06-22 04:00:00+00:00 -0.224544 0.853807 ...
surface_pressure_max surface_pressure_min \
date
1980-01-08 04:00:00+00:00 1.417812 1.206291
1980-01-09 04:00:00+00:00 1.747181 1.661176
1980-01-10 04:00:00+00:00 1.482645 1.326241
1980-01-11 04:00:00+00:00 1.296188 1.089516
1980-01-12 04:00:00+00:00 0.481424 0.365267
... ... ...
2025-06-18 04:00:00+00:00 1.371331 1.858911
2025-06-19 04:00:00+00:00 1.142164 1.745246
2025-06-20 04:00:00+00:00 1.260285 1.562741
2025-06-21 04:00:00+00:00 1.171828 1.029269
2025-06-22 04:00:00+00:00 0.611156 0.593911
pressure_msl_max pressure_msl_min \
date
1980-01-08 04:00:00+00:00 1.446653 1.296101
1980-01-09 04:00:00+00:00 1.758363 1.696707
1980-01-10 04:00:00+00:00 1.602492 1.396253
1980-01-11 04:00:00+00:00 1.342750 1.195980
1980-01-12 04:00:00+00:00 0.667428 0.545011
... ... ...
2025-06-18 04:00:00+00:00 1.290815 1.746768
2025-06-19 04:00:00+00:00 1.031073 1.646616
2025-06-20 04:00:00+00:00 1.134976 1.496404
2025-06-21 04:00:00+00:00 1.134976 0.945617
2025-06-22 04:00:00+00:00 0.511589 0.494951
relative_humidity_2m_max relative_humidity_2m_min \
date
1980-01-08 04:00:00+00:00 0.608421 -0.313262
1980-01-09 04:00:00+00:00 0.659651 0.043225
1980-01-10 04:00:00+00:00 1.206694 -0.188165
1980-01-11 04:00:00+00:00 -0.571764 -1.718097
1980-01-12 04:00:00+00:00 0.966278 0.801040
... ... ...
2025-06-18 04:00:00+00:00 -0.294514 -0.450565
2025-06-19 04:00:00+00:00 -0.231789 -0.001594
2025-06-20 04:00:00+00:00 0.384252 -0.292617
2025-06-21 04:00:00+00:00 0.117758 -0.039553
2025-06-22 04:00:00+00:00 0.429910 -0.039553
wet_bulb_temperature_2m_max \
date
1980-01-08 04:00:00+00:00 -0.619876
1980-01-09 04:00:00+00:00 -0.707428
1980-01-10 04:00:00+00:00 -0.923191
1980-01-11 04:00:00+00:00 -1.630507
1980-01-12 04:00:00+00:00 -1.448321
... ...
2025-06-18 04:00:00+00:00 0.809145
2025-06-19 04:00:00+00:00 0.985915
2025-06-20 04:00:00+00:00 1.002288
2025-06-21 04:00:00+00:00 0.717536
2025-06-22 04:00:00+00:00 0.838071
wet_bulb_temperature_2m_min \
date
1980-01-08 04:00:00+00:00 -0.556888
1980-01-09 04:00:00+00:00 -0.436827
1980-01-10 04:00:00+00:00 -1.390226
1980-01-11 04:00:00+00:00 -2.212730
1980-01-12 04:00:00+00:00 -1.245027
... ...
2025-06-18 04:00:00+00:00 0.621683
2025-06-19 04:00:00+00:00 0.932243
2025-06-20 04:00:00+00:00 0.521346
2025-06-21 04:00:00+00:00 0.817217
2025-06-22 04:00:00+00:00 0.678709
vapour_pressure_deficit_max \
date
1980-01-08 04:00:00+00:00 0.067543
1980-01-09 04:00:00+00:00 -0.271330
1980-01-10 04:00:00+00:00 -0.318863
1980-01-11 04:00:00+00:00 0.970315
1980-01-12 04:00:00+00:00 -1.144075
... ...
2025-06-18 04:00:00+00:00 0.637523
2025-06-19 04:00:00+00:00 0.202919
2025-06-20 04:00:00+00:00 0.480632
2025-06-21 04:00:00+00:00 0.194522
2025-06-22 04:00:00+00:00 0.194522
soil_temperature_0_to_7cm_mean
date
1980-01-08 04:00:00+00:00 -0.691371
1980-01-09 04:00:00+00:00 -0.741405
1980-01-10 04:00:00+00:00 -0.770006
1980-01-11 04:00:00+00:00 -0.798600
1980-01-12 04:00:00+00:00 -0.827197
... ...
2025-06-18 04:00:00+00:00 0.132629
2025-06-19 04:00:00+00:00 0.114758
2025-06-20 04:00:00+00:00 0.109992
2025-06-21 04:00:00+00:00 0.121908
2025-06-22 04:00:00+00:00 0.138589
[16603 rows x 21 columns]
from sklearn.neighbors import LocalOutlierFactor
# Initialize the LOF model
lof = LocalOutlierFactor(n_neighbors=30)
# Create a copy of the original dataframe to avoid modifying it
# Fit the model and predict outliers on the selected columns
outlier_detect = goody_frame_zscore.copy()
# Add the LOF column with outlier predictions (-1 for outliers, 1 for inliers)
outlier_detect['LOF'] = lof.fit_predict(outlier_detect[columns])
# Extract the rows where outliers are identified (LOF == -1)
outliers = outlier_detect[outlier_detect['LOF'] == -1]
# Check LOF class counts
counts = outlier_detect['LOF'].value_counts()
print("LOF\n", counts)
# Display the original DataFrame with the new LOF column and the outliers
print("DataFrame with LOF results:")
print(outlier_detect)
print("\nOutliers:")
print(outliers)
# Count the number of outliers for each column
outlier_frequencies = {}
for col in columns:
# Count how often outliers occur in each column
outlier_frequencies[col] = (outliers[col].notna()).sum()
# Display the outlier frequencies
print("\nOutlier frequencies per column:")
for col, freq in outlier_frequencies.items():
print(f"{col}: {freq} outliers")
LOF
LOF
1 16491
-1 112
Name: count, dtype: int64
DataFrame with LOF results:
temperature_2m_mean temperature_2m_max \
date
1980-01-08 04:00:00+00:00 -0.784183 -0.660480
1980-01-09 04:00:00+00:00 -0.880507 -0.844319
1980-01-10 04:00:00+00:00 -1.702019 -1.359068
1980-01-11 04:00:00+00:00 -1.471197 -1.248766
1980-01-12 04:00:00+00:00 -2.588952 -2.167964
... ... ...
2025-06-18 04:00:00+00:00 1.101646 0.815751
2025-06-19 04:00:00+00:00 1.048937 0.742217
2025-06-20 04:00:00+00:00 0.939889 0.852522
2025-06-21 04:00:00+00:00 0.974421 0.631912
2025-06-22 04:00:00+00:00 0.910809 0.631912
temperature_2m_min apparent_temperature_mean \
date
1980-01-08 04:00:00+00:00 -1.056235 -1.186370
1980-01-09 04:00:00+00:00 -1.056235 -1.065703
1980-01-10 04:00:00+00:00 -1.811791 -1.652678
1980-01-11 04:00:00+00:00 -1.367347 -2.180867
1980-01-12 04:00:00+00:00 -2.522900 -2.526479
... ... ...
2025-06-18 04:00:00+00:00 1.261536 0.002126
2025-06-19 04:00:00+00:00 1.217093 -0.089033
2025-06-20 04:00:00+00:00 1.305980 0.183126
2025-06-21 04:00:00+00:00 1.172649 0.325964
2025-06-22 04:00:00+00:00 0.950427 0.066610
apparent_temperature_max apparent_temperature_min \
date
1980-01-08 04:00:00+00:00 -1.182242 -1.125872
1980-01-09 04:00:00+00:00 -1.112033 -0.831592
1980-01-10 04:00:00+00:00 -1.638058 -1.598375
1980-01-11 04:00:00+00:00 -2.405936 -2.075698
1980-01-12 04:00:00+00:00 -2.723019 -2.418513
... ... ...
2025-06-18 04:00:00+00:00 0.049628 0.102582
2025-06-19 04:00:00+00:00 -0.468265 0.061354
2025-06-20 04:00:00+00:00 0.290326 0.196790
2025-06-21 04:00:00+00:00 0.425916 0.440145
2025-06-22 04:00:00+00:00 -0.111113 -0.165208
wind_speed_10m_max et0_fao_evapotranspiration \
date
1980-01-08 04:00:00+00:00 0.981832 -0.657839
1980-01-09 04:00:00+00:00 0.933995 -0.706750
1980-01-10 04:00:00+00:00 0.746366 -1.635291
1980-01-11 04:00:00+00:00 1.716280 0.183672
1980-01-12 04:00:00+00:00 1.418823 -2.302173
... ... ...
2025-06-18 04:00:00+00:00 1.616393 1.365891
2025-06-19 04:00:00+00:00 2.008112 0.797360
2025-06-20 04:00:00+00:00 1.536482 0.693091
2025-06-21 04:00:00+00:00 1.126042 0.880116
2025-06-22 04:00:00+00:00 2.013563 0.895287
rain_sum dew_point_2m_max ... \
date ...
1980-01-08 04:00:00+00:00 -0.121174 -0.434958 ...
1980-01-09 04:00:00+00:00 -0.265892 -0.502522 ...
1980-01-10 04:00:00+00:00 0.126914 -0.502522 ...
1980-01-11 04:00:00+00:00 -0.327914 -1.482185 ...
1980-01-12 04:00:00+00:00 0.747134 -1.043027 ...
... ... ... ...
2025-06-18 04:00:00+00:00 -0.410610 0.718683 ...
2025-06-19 04:00:00+00:00 -0.369262 0.955153 ...
2025-06-20 04:00:00+00:00 -0.410610 1.056498 ...
2025-06-21 04:00:00+00:00 -0.431284 0.786245 ...
2025-06-22 04:00:00+00:00 -0.224544 0.853807 ...
surface_pressure_min pressure_msl_max \
date
1980-01-08 04:00:00+00:00 1.206291 1.446653
1980-01-09 04:00:00+00:00 1.661176 1.758363
1980-01-10 04:00:00+00:00 1.326241 1.602492
1980-01-11 04:00:00+00:00 1.089516 1.342750
1980-01-12 04:00:00+00:00 0.365267 0.667428
... ... ...
2025-06-18 04:00:00+00:00 1.858911 1.290815
2025-06-19 04:00:00+00:00 1.745246 1.031073
2025-06-20 04:00:00+00:00 1.562741 1.134976
2025-06-21 04:00:00+00:00 1.029269 1.134976
2025-06-22 04:00:00+00:00 0.593911 0.511589
pressure_msl_min relative_humidity_2m_max \
date
1980-01-08 04:00:00+00:00 1.296101 0.608421
1980-01-09 04:00:00+00:00 1.696707 0.659651
1980-01-10 04:00:00+00:00 1.396253 1.206694
1980-01-11 04:00:00+00:00 1.195980 -0.571764
1980-01-12 04:00:00+00:00 0.545011 0.966278
... ... ...
2025-06-18 04:00:00+00:00 1.746768 -0.294514
2025-06-19 04:00:00+00:00 1.646616 -0.231789
2025-06-20 04:00:00+00:00 1.496404 0.384252
2025-06-21 04:00:00+00:00 0.945617 0.117758
2025-06-22 04:00:00+00:00 0.494951 0.429910
relative_humidity_2m_min \
date
1980-01-08 04:00:00+00:00 -0.313262
1980-01-09 04:00:00+00:00 0.043225
1980-01-10 04:00:00+00:00 -0.188165
1980-01-11 04:00:00+00:00 -1.718097
1980-01-12 04:00:00+00:00 0.801040
... ...
2025-06-18 04:00:00+00:00 -0.450565
2025-06-19 04:00:00+00:00 -0.001594
2025-06-20 04:00:00+00:00 -0.292617
2025-06-21 04:00:00+00:00 -0.039553
2025-06-22 04:00:00+00:00 -0.039553
wet_bulb_temperature_2m_max \
date
1980-01-08 04:00:00+00:00 -0.619876
1980-01-09 04:00:00+00:00 -0.707428
1980-01-10 04:00:00+00:00 -0.923191
1980-01-11 04:00:00+00:00 -1.630507
1980-01-12 04:00:00+00:00 -1.448321
... ...
2025-06-18 04:00:00+00:00 0.809145
2025-06-19 04:00:00+00:00 0.985915
2025-06-20 04:00:00+00:00 1.002288
2025-06-21 04:00:00+00:00 0.717536
2025-06-22 04:00:00+00:00 0.838071
wet_bulb_temperature_2m_min \
date
1980-01-08 04:00:00+00:00 -0.556888
1980-01-09 04:00:00+00:00 -0.436827
1980-01-10 04:00:00+00:00 -1.390226
1980-01-11 04:00:00+00:00 -2.212730
1980-01-12 04:00:00+00:00 -1.245027
... ...
2025-06-18 04:00:00+00:00 0.621683
2025-06-19 04:00:00+00:00 0.932243
2025-06-20 04:00:00+00:00 0.521346
2025-06-21 04:00:00+00:00 0.817217
2025-06-22 04:00:00+00:00 0.678709
vapour_pressure_deficit_max \
date
1980-01-08 04:00:00+00:00 0.067543
1980-01-09 04:00:00+00:00 -0.271330
1980-01-10 04:00:00+00:00 -0.318863
1980-01-11 04:00:00+00:00 0.970315
1980-01-12 04:00:00+00:00 -1.144075
... ...
2025-06-18 04:00:00+00:00 0.637523
2025-06-19 04:00:00+00:00 0.202919
2025-06-20 04:00:00+00:00 0.480632
2025-06-21 04:00:00+00:00 0.194522
2025-06-22 04:00:00+00:00 0.194522
soil_temperature_0_to_7cm_mean LOF
date
1980-01-08 04:00:00+00:00 -0.691371 1
1980-01-09 04:00:00+00:00 -0.741405 1
1980-01-10 04:00:00+00:00 -0.770006 1
1980-01-11 04:00:00+00:00 -0.798600 1
1980-01-12 04:00:00+00:00 -0.827197 1
... ... ...
2025-06-18 04:00:00+00:00 0.132629 1
2025-06-19 04:00:00+00:00 0.114758 1
2025-06-20 04:00:00+00:00 0.109992 1
2025-06-21 04:00:00+00:00 0.121908 1
2025-06-22 04:00:00+00:00 0.138589 1
[16603 rows x 22 columns]
Outliers:
temperature_2m_mean temperature_2m_max \
date
1982-02-24 04:00:00+00:00 -2.180018 -1.947357
1984-04-30 04:00:00+00:00 -2.038255 -1.579678
1984-07-25 04:00:00+00:00 -0.297094 -0.219263
1984-12-16 04:00:00+00:00 -1.393043 -1.506140
1985-07-17 04:00:00+00:00 -0.295277 -0.072191
... ... ...
2024-07-03 04:00:00+00:00 1.977677 1.955559
2024-07-13 04:00:00+00:00 1.881351 1.624647
2025-03-20 04:00:00+00:00 0.138375 0.117163
2025-03-21 04:00:00+00:00 0.789039 1.073129
2025-06-11 04:00:00+00:00 1.188884 0.889289
temperature_2m_min apparent_temperature_mean \
date
1982-02-24 04:00:00+00:00 -1.722900 -2.621166
1984-04-30 04:00:00+00:00 -2.389566 -0.549201
1984-07-25 04:00:00+00:00 0.010427 -1.188954
1984-12-16 04:00:00+00:00 -0.878460 -1.182755
1985-07-17 04:00:00+00:00 -1.500679 -0.330040
... ... ...
2024-07-03 04:00:00+00:00 1.128202 0.302562
2024-07-13 04:00:00+00:00 1.083758 1.279666
2025-03-20 04:00:00+00:00 0.194871 1.457534
2025-03-21 04:00:00+00:00 0.594871 1.108831
2025-06-11 04:00:00+00:00 1.128202 -0.352042
apparent_temperature_max apparent_temperature_min \
date
1982-02-24 04:00:00+00:00 -1.727251 -2.895743
1984-04-30 04:00:00+00:00 -0.123619 -0.571303
1984-07-25 04:00:00+00:00 -1.544697 -1.016567
1984-12-16 04:00:00+00:00 -1.223266 -0.976929
1985-07-17 04:00:00+00:00 -0.222561 -1.134175
... ... ...
2024-07-03 04:00:00+00:00 0.192599 0.069899
2024-07-13 04:00:00+00:00 1.326968 0.652617
2025-03-20 04:00:00+00:00 1.346307 1.364430
2025-03-21 04:00:00+00:00 1.452034 1.049304
2025-06-11 04:00:00+00:00 -0.253851 -0.359380
wind_speed_10m_max et0_fao_evapotranspiration \
date
1982-02-24 04:00:00+00:00 1.087166 1.061252
1984-04-30 04:00:00+00:00 -2.034551 -1.022322
1984-07-25 04:00:00+00:00 3.467249 -1.611040
1984-12-16 04:00:00+00:00 -0.551453 -0.805730
1985-07-17 04:00:00+00:00 0.725451 -0.050931
... ... ...
2024-07-03 04:00:00+00:00 2.594206 1.301066
2024-07-13 04:00:00+00:00 2.525774 0.481734
2025-03-20 04:00:00+00:00 -3.117747 0.105352
2025-03-21 04:00:00+00:00 -0.357679 0.688249
2025-06-11 04:00:00+00:00 3.021438 1.828514
rain_sum dew_point_2m_max ... \
date ...
1982-02-24 04:00:00+00:00 -0.348588 -0.603865 ...
1984-04-30 04:00:00+00:00 1.553419 -0.705211 ...
1984-07-25 04:00:00+00:00 2.731837 0.848740 ...
1984-12-16 04:00:00+00:00 -0.348588 -1.482185 ...
1985-07-17 04:00:00+00:00 1.987573 0.274455 ...
... ... ... ...
2024-07-03 04:00:00+00:00 0.561068 1.765911 ...
2024-07-13 04:00:00+00:00 -0.389936 2.069945 ...
2025-03-20 04:00:00+00:00 -0.224544 0.279522 ...
2025-03-21 04:00:00+00:00 -0.389936 0.414648 ...
2025-06-11 04:00:00+00:00 -0.224544 0.617337 ...
surface_pressure_min pressure_msl_max \
date
1982-02-24 04:00:00+00:00 -0.288667 0.667428
1984-04-30 04:00:00+00:00 0.334069 0.511589
1984-07-25 04:00:00+00:00 -1.367635 -0.891024
1984-12-16 04:00:00+00:00 -3.451599 -3.644282
1985-07-17 04:00:00+00:00 0.699240 1.238879
... ... ...
2024-07-03 04:00:00+00:00 -0.464696 -0.527379
2024-07-13 04:00:00+00:00 1.307451 0.875234
2025-03-20 04:00:00+00:00 0.174329 0.563525
2025-03-21 04:00:00+00:00 0.920702 1.186912
2025-06-11 04:00:00+00:00 1.557514 1.498621
pressure_msl_min relative_humidity_2m_max \
date
1982-02-24 04:00:00+00:00 -0.155988 0.926413
1984-04-30 04:00:00+00:00 0.545011 1.311157
1984-07-25 04:00:00+00:00 -1.357745 1.180015
1984-12-16 04:00:00+00:00 -3.310561 -1.160941
1985-07-17 04:00:00+00:00 0.695253 1.723833
... ... ...
2024-07-03 04:00:00+00:00 -0.606654 1.027962
2024-07-13 04:00:00+00:00 1.145889 0.986708
2025-03-20 04:00:00+00:00 0.194527 -0.680470
2025-03-21 04:00:00+00:00 0.895526 0.156970
2025-06-11 04:00:00+00:00 1.446344 -0.091745
relative_humidity_2m_min \
date
1982-02-24 04:00:00+00:00 -2.597358
1984-04-30 04:00:00+00:00 0.710764
1984-07-25 04:00:00+00:00 1.176724
1984-12-16 04:00:00+00:00 -0.217274
1985-07-17 04:00:00+00:00 0.684803
... ...
2024-07-03 04:00:00+00:00 -0.477517
2024-07-13 04:00:00+00:00 0.836517
2025-03-20 04:00:00+00:00 -0.944173
2025-03-21 04:00:00+00:00 -1.055872
2025-06-11 04:00:00+00:00 -0.542302
wet_bulb_temperature_2m_max \
date
1982-02-24 04:00:00+00:00 -0.937965
1984-04-30 04:00:00+00:00 -1.053084
1984-07-25 04:00:00+00:00 0.695513
1984-12-16 04:00:00+00:00 -1.451995
1985-07-17 04:00:00+00:00 0.228041
... ...
2024-07-03 04:00:00+00:00 1.859969
2024-07-13 04:00:00+00:00 2.183517
2025-03-20 04:00:00+00:00 0.337831
2025-03-21 04:00:00+00:00 0.543531
2025-06-11 04:00:00+00:00 0.619203
wet_bulb_temperature_2m_min \
date
1982-02-24 04:00:00+00:00 -3.522482
1984-04-30 04:00:00+00:00 -0.993538
1984-07-25 04:00:00+00:00 0.698083
1984-12-16 04:00:00+00:00 -1.309165
1985-07-17 04:00:00+00:00 -0.228983
... ...
2024-07-03 04:00:00+00:00 1.333226
2024-07-13 04:00:00+00:00 1.567723
2025-03-20 04:00:00+00:00 -0.277850
2025-03-21 04:00:00+00:00 0.504016
2025-06-11 04:00:00+00:00 0.717358
vapour_pressure_deficit_max \
date
1982-02-24 04:00:00+00:00 1.237140
1984-04-30 04:00:00+00:00 -0.961819
1984-07-25 04:00:00+00:00 -1.093742
1984-12-16 04:00:00+00:00 -0.268469
1985-07-17 04:00:00+00:00 -0.616006
... ...
2024-07-03 04:00:00+00:00 1.013464
2024-07-13 04:00:00+00:00 -0.355880
2025-03-20 04:00:00+00:00 0.844324
2025-03-21 04:00:00+00:00 1.290921
2025-06-11 04:00:00+00:00 0.746506
soil_temperature_0_to_7cm_mean LOF
date
1982-02-24 04:00:00+00:00 -1.270411 -1
1984-04-30 04:00:00+00:00 -0.734257 -1
1984-07-25 04:00:00+00:00 -0.137349 -1
1984-12-16 04:00:00+00:00 -0.998759 -1
1985-07-17 04:00:00+00:00 -0.348237 -1
... ... ...
2024-07-03 04:00:00+00:00 0.933276 -1
2024-07-13 04:00:00+00:00 0.893960 -1
2025-03-20 04:00:00+00:00 0.146927 -1
2025-03-21 04:00:00+00:00 0.156459 -1
2025-06-11 04:00:00+00:00 0.329218 -1
[112 rows x 22 columns]
Outlier frequencies per column:
temperature_2m_mean: 112 outliers
temperature_2m_max: 112 outliers
temperature_2m_min: 112 outliers
apparent_temperature_mean: 112 outliers
apparent_temperature_max: 112 outliers
apparent_temperature_min: 112 outliers
wind_speed_10m_max: 112 outliers
et0_fao_evapotranspiration: 112 outliers
rain_sum: 112 outliers
dew_point_2m_max: 112 outliers
dew_point_2m_min: 112 outliers
surface_pressure_max: 112 outliers
surface_pressure_min: 112 outliers
pressure_msl_max: 112 outliers
pressure_msl_min: 112 outliers
relative_humidity_2m_max: 112 outliers
relative_humidity_2m_min: 112 outliers
wet_bulb_temperature_2m_max: 112 outliers
wet_bulb_temperature_2m_min: 112 outliers
vapour_pressure_deficit_max: 112 outliers
soil_temperature_0_to_7cm_mean: 112 outliers
The LOF algorithm offers a row-by-row analysis of the meteorological data to identify outliers. This approach involves assessing the local density of each data point relative to its neighbors.
LOF's Row-by-Row Analysis:
The LOF model operates on a row-by-row basis, examining each row of data independently. Each row in your dataset (e.g., a daily weather observation) is assessed individually; weather data recorded for a specific day at a particular location, having multiple attributes. It focuses on the values in the specified columns to assess the local density of each data point. By comparing the density of a row (data point) to its neighbors, the model can determine whether it's an outlier or an inlier.
Outlier Determination:
If a row's local density is significantly lower than its neighbors, LOF labels it as an outlier (-1). Conversely, rows with similar densities to their neighbors are labeled as inliers (1). This labeling is reflected in the LOF column of your outlier_detect DataFrame.
Temporal Context and Outlier Detection:
The columns you're analyzing (PRCP, SNOW, SNWD, TMIN, and TMAX) are associated with a Datetime column, indicating that the data points correspond to specific time points. This temporal context is essential to consider when interpreting outliers.
Seasonal Variations and Outlier Identification:
Weather data often exhibits seasonal patterns. Values that might be considered outliers in one season could be perfectly normal in another. LOF, while effective for general outlier detection, doesn't inherently account for these temporal relationships. It's highly likely that a 30 neighbor consideration will not misidentify.
Developing A Classification Model Based On LOF¶
The Local Outlier Factor (LOF) method provides a valuable tool for identifying anomalous data points within a dataset. By labeling outliers as -1 and inliers as 1, LOF effectively transforms the outlier detection problem into a binary classification task. This opens the door for the application of various classification algorithms, including logistic regression and support vector machine. One can assign the -1 instances as extreme events. Assuming the case of extreme events to have one day durations, extreme events would then fall into the following classes: extreme heat day, blizzard day, extreme cold day and day of flooding; the latter term depends on a terrain that encourages and sustains water elevation. Such extreme events identification is based on the attributes of PRCP, SNOW, SNWD, TMIN, TMAX in the dataset.
Logistic Regression, a popular statistical model, can be employed to predict the probability of a data point (event) belonging to a specific class (in this case, outlier or inlier to be identified with extreme events and normal weather). By using the LOF column as the target variable and selecting relevant features from the meteorological data, one can train logistic regression models to identify outliers based on the observed characteristics.To recall the logistic regression model is of the form:
$$P(Y = 1 | X) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k)}}$$where:
$$P(Y = 1 | X)\,\,\text {is the probability of the outcome being 1}.$$$$X\,\, \text {represents the vector of predictor variables} (X_1, X_2, \ldots, X_k)$$$$\beta_0\,\,\text {is the intercept.}$$$$(\beta_1, \beta_2, \ldots, \beta_k)\,\,\text {are the coefficients for the predictor variables.}$$$$(e)\,\,\text {is the standard exponential value.}$$The log-odds (logit) transformation of the model is given by:
$$\text{logit}(P) = \log\left(\frac{P(Y = 1 | X)}{1 - P(Y = 1 | X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k$$The performance of the logistic regression models can be evaluated using metrics such as accuracy, precision, recall, and the ROC-AUC score. These metrics provide insights into the model's ability to correctly classify outliers and inliers, as well as its sensitivity and specificity.
It's important to note that logistic regression offers a probabilistic approach to outlier detection, modeling the presence of outliers as a function of input features. This approach contrasts with Extreme Value Analysis (EVA), which focuses on the tail behavior of distributions.
Logistic regression can be sensitive to imbalanced classes. To create a balanced dataset in Python, you can employ several techniques, particularly if your data is imbalanced with respect to the target classes. One common approach is oversampling the minority class or undersampling the majority class. Will implement the latter.
Note: no feature selection to be done on the current "outlier_detect" dataframe because the attributes are significant or elementary by consensus concerning extreme weather.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils import resample
import pandas as pd
# Transform 'LOF' Column
outlier_detect_logit = outlier_detect.copy()
outlier_detect_logit['LOF'] = outlier_detect_logit['LOF'].replace({1: 0, -1: 1})
columns_features = outlier_detect_logit.drop(columns='LOF').columns
# Count the occurrences of each unique value in 'LOF'
counts = outlier_detect_logit['LOF'].value_counts()
print(counts)
# Separate features and target
X = outlier_detect_logit[columns_features]
y = outlier_detect_logit['LOF']
print("Overall class distribution:")
print(y.value_counts())
# If there are any instances of the minority class, proceed with the split
if (y == 1).sum() > 0:
# Train-Test Split with stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print("\nClass distribution in training set:")
print(y_train.value_counts())
print("\nClass distribution in test set:")
print(y_test.value_counts())
# If the minority class (1) is present in the training set, no need to resample
if (y_train == 1).sum() > 0:
X_train_balanced = X_train
y_train_balanced = y_train
else:
# Upsample the minority class before splitting
print("\nThe minority class is not present in the training set after splitting.")
print("Adjusting by upsampling the minority class before splitting...")
df_majority = outlier_detect[outlier_detect['LOF'] == 0]
df_minority = outlier_detect[outlier_detect['LOF'] == 1]
df_minority_upsampled = resample(df_minority,
replace=True,
n_samples=df_majority.shape[0],
random_state=42)
df_balanced = pd.concat([df_majority, df_minority_upsampled])
X = df_balanced[columns_features]
y = df_balanced['LOF']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
X_train_balanced = X_train
y_train_balanced = y_train
print("\nClass distribution in balanced training set:")
print(y_train_balanced.value_counts())
# Train the Logistic Regression Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train_balanced, y_train_balanced)
# Validate the model
y_pred = model.predict(X_test)
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:\n", conf_matrix)
# Classification Report
report = classification_report(y_test, y_pred)
print("\nClassification Report:\n", report)
# Display coefficients and intercept
print("\nCoefficients:\n", model.coef_)
print("Intercept:\n", model.intercept_)
else:
print("The dataset does not contain any instances of the minority class (1). Consider collecting more data or adjusting the dataset.")
LOF
0 16491
1 112
Name: count, dtype: int64
Overall class distribution:
LOF
0 16491
1 112
Name: count, dtype: int64
Class distribution in training set:
LOF
0 13192
1 90
Name: count, dtype: int64
Class distribution in test set:
LOF
0 3299
1 22
Name: count, dtype: int64
Confusion Matrix:
[[3298 1]
[ 19 3]]
Classification Report:
precision recall f1-score support
0 0.99 1.00 1.00 3299
1 0.75 0.14 0.23 22
accuracy 0.99 3321
macro avg 0.87 0.57 0.61 3321
weighted avg 0.99 0.99 0.99 3321
Coefficients:
[[-0.4142905 -1.32995903 -0.90096399 2.00923998 0.14820063 -1.12019417
0.68060291 -0.46578716 0.01631078 0.89305918 -0.86632205 -0.0371282
-0.38230594 0.55503656 -0.23326582 -0.20887113 -0.26927343 1.75853439
-0.50782609 0.30522694 0.54741553]]
Intercept:
[-6.08772099]
For the general LOF class distribution only about 0.67% of samples belong to the minority class (1). This imbalance has a significant impact on evaluation metrics. There is also high observed imbalance for both the training set and test set.
CONFUSION MATRIX:
True Negatives (TN): 3298
False Positives (FP): 1
False Negatives (FN): 19
True Positives (TP): 3
Hence, the model misses most of the rare class (high false negative rate), which is expected due to imbalance.
CLASSIFICATION REPORT:
Precision (class 1): 75% → When it predicts 1, it's correct 75% of the time.
Recall (class 1): 14% → It captures only 14% of the actual positives, which is quite poor.
F1-score (class 1): 0.23 → Harmonic mean of precision and recall, very low due to poor recall.
Macro avg: Average across classes, treating them equally.
Weighted avg: Takes class imbalance into account.
ACCURACY: 0.99 This is misleading in imbalanced problems. Predicting everything as class 0 would already give you ~99% accuracy.
COEFFICIENTS:
Features with larger magnitude coefficients have more influence.
Positive coefficients increase the log-odds of class 1.
E.g., feature #4 with weight 2.009 and feature #18 with 1.758 contribute most positively toward classifying as class 1.
The intercept of -6.09 implies a low base probability of class 1 before feature effects are added.
SUMMARY -- The model is accurate, but not useful for minority class detection:
Only 3 out of 22 actual class-1 samples were detected.
It may be heavily biased toward predicting class 0.
Observing the disproportionate orientation (whether general sample, training set or test set), SMOTE or undersample or other things likely will not be much helpful. The model generally isn't reliable.
Identifying the Rate of Outliers (Extreme Events) Based On LOF¶
Identifying Outliers Over Time¶
Visualizing the temporal distribution of outliers provides valuable insights into the underlying patterns and trends in a dataset. By examining how the number of outliers changes over time, you can identify seasonal variations, anomalies, or broader trends that may be indicative of underlying factors or changes in the data-generating process.
Key Observations
Seasonal Patterns: If the number of outliers consistently increases or decreases during specific months or seasons, it suggests a seasonal influence on the data. This could be due to factors such as weather patterns, economic cycles, or human behavior.
Anomalies: Outliers that deviate significantly from the expected seasonal patterns or overall trend might indicate unusual events or data errors. These anomalies could be further investigated to understand their underlying causes.
Trends: Observing increasing or decreasing trends in the number of outliers over time can reveal broader changes in the data-generating process. This might be indicative of shifts in underlying conditions, such as changes in climate, economic conditions, or technological advancements.
For the island of Montserrat will now observe outlier counts in the months of July and September. The two most influential weather months for Montserrat are July and September. July marks the start of the rainy season, which continues through November, and September is the wettest month. While Montserrat experiences warm, tropical weather year-round, these months significantly impact rainfall patterns and potential for hurricanes. To be accomplished with a trend model based on the "Prophet" algorithm.
from prophet import Prophet
outlier_detect.info()
outlier_detect_reset = outlier_detect.reset_index()
print(outlier_detect_reset.columns)
# Ensure 'date' column exists and is datetime
outlier_detect_reset['date'] = pd.to_datetime(outlier_detect_reset['date'])
# Extract month and year if not already done
outlier_detect_reset['month'] = outlier_detect_reset['date'].dt.month
outlier_detect_reset['year'] = outlier_detect_reset['date'].dt.year
# Filter for outliers in January and July only
july_sep_outliers = outlier_detect_reset[(goody_frame_zscore['LOF'] == 1) & (outlier_detect_reset['month'].isin([7, 9]))]
# Check if there are any rows in the filtered dataset
if july_sep_outliers.empty:
print("No outliers found for January and July. Please check the dataset or filtering criteria.")
else:
# Group by year and month to count outliers
outlier_counts = july_sep_outliers.groupby(['year', 'month']).size().reset_index(name='outlier_count')
# Create a 'date' column for Prophet, using the 'year' and 'month'
outlier_counts['date'] = pd.to_datetime(outlier_counts[['year', 'month']].assign(day=1))
# Prepare the data for Prophet (requires 'ds' and 'y' column names)
prophet_data = outlier_counts[['date', 'outlier_count']]
prophet_data.columns = ['ds', 'y']
# Check if prophet_data has at least two non-NaN rows
if prophet_data.dropna().shape[0] < 2:
print("The dataset has less than 2 non-NaN rows. Not enough data for Prophet model.")
else:
# Define and fit the Prophet model
model = Prophet()
model.fit(prophet_data)
# Create a DataFrame for future predictions (e.g., 24 months ahead to include future January and July)
future_dates = model.make_future_dataframe(periods=24, freq='ME')
# Forecast
forecast = model.predict(future_dates)
# Filter the forecasted data for only January and July
forecast_july_sep = forecast[(forecast['ds'].dt.month.isin([1, 7]))]
# Merge the actual and forecasted data for visualization
merged_data = pd.concat([prophet_data, forecast_july_sep[['ds', 'yhat']].rename(columns={'yhat': 'outlier_count'})])
plt.figure(figsize=(10, 6)) # Adjust the figure size to be wider and shorter
plt.plot(prophet_data['ds'], prophet_data['y'], marker='o', linestyle='-', label='Actual Outlier Count')
plt.plot(forecast_july_sep['ds'], forecast_july_sep['yhat'], marker='o', linestyle='--',
label='Forecasted Outlier Count', color='orange')
plt.xlabel('Date')
plt.ylabel('Outlier Count')
plt.title('Outlier Counts for January and July Over the Years with Forecast')
plt.legend()
plt.grid(True) # Optional: Adds grid lines for better readability
plt.show()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 16603 entries, 1980-01-08 04:00:00+00:00 to 2025-06-22 04:00:00+00:00
Data columns (total 22 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 temperature_2m_mean 16603 non-null float32
1 temperature_2m_max 16603 non-null float32
2 temperature_2m_min 16603 non-null float32
3 apparent_temperature_mean 16603 non-null float32
4 apparent_temperature_max 16603 non-null float32
5 apparent_temperature_min 16603 non-null float32
6 wind_speed_10m_max 16603 non-null float32
7 et0_fao_evapotranspiration 16603 non-null float32
8 rain_sum 16603 non-null float32
9 dew_point_2m_max 16603 non-null float32
10 dew_point_2m_min 16603 non-null float32
11 surface_pressure_max 16603 non-null float32
12 surface_pressure_min 16603 non-null float32
13 pressure_msl_max 16603 non-null float32
14 pressure_msl_min 16603 non-null float32
15 relative_humidity_2m_max 16603 non-null float32
16 relative_humidity_2m_min 16603 non-null float32
17 wet_bulb_temperature_2m_max 16603 non-null float32
18 wet_bulb_temperature_2m_min 16603 non-null float32
19 vapour_pressure_deficit_max 16603 non-null float32
20 soil_temperature_0_to_7cm_mean 16603 non-null float32
21 LOF 16603 non-null int32
dtypes: float32(21), int32(1)
memory usage: 1.5 MB
Index(['date', 'temperature_2m_mean', 'temperature_2m_max',
'temperature_2m_min', 'apparent_temperature_mean',
'apparent_temperature_max', 'apparent_temperature_min',
'wind_speed_10m_max', 'et0_fao_evapotranspiration', 'rain_sum',
'dew_point_2m_max', 'dew_point_2m_min', 'surface_pressure_max',
'surface_pressure_min', 'pressure_msl_max', 'pressure_msl_min',
'relative_humidity_2m_max', 'relative_humidity_2m_min',
'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min',
'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean', 'LOF'],
dtype='object')
No outliers found for January and July. Please check the dataset or filtering criteria.
C:\Users\verlene\AppData\Local\Temp\ipykernel_10952\891376500.py:16: UserWarning: Boolean Series key will be reindexed to match DataFrame index. july_sep_outliers = outlier_detect_reset[(goody_frame_zscore['LOF'] == 1) & (outlier_detect_reset['month'].isin([7, 9]))]
No outliers have been detected, hence will pursue an alternative outlier detector in the near future concerning outlier counts. Recalling from the LOF development, the outlier count was quite poor, hence such should be expected.
Outlier Detection Based Histogram¶
Histogram-Based Outlier Score (HBOS) is an efficient, univariate outlier detection method that works by analyzing the distribution of each feature (attribute) independently. Unlike algorithms like LOF that consider all attributes together in a multi-dimensional space, HBOS looks at each attribute separately, making it computationally efficient and suitable for high-dimensional datasets.
Mathematical Structure for steps in the HBOS Process¶
1. Construct the Histogram:
Given a dataset $X = {x_1,x_2,...,x_n}$, chhose a number of bins $k$ and compute the histogram:
$$H = {h_1,h_2,...,h_k}$$where $h_j$ represents the count (or frequency) of data points falling into the $j$-th bin.
2. Calculate Bin Width:
The width of each bin $w$ can be computed as:
$$w = \frac{\text{max}(X) - \text{min}(X)}{k}$$3. Estimate Probability Density:
The probability density function (PDF) for each bin:
$$p_j = \frac{h_j}{n\cdot\,w}$$where $p_j$ is the probability density of the $j$-th bin.
4. Outlier Score Calculation:
For each data point $x_i$, determine the bin $b_i$ it fall into. The outlier score $O(x_i)$ for that point can be identified as:
$$O(x_i) = \frac{1}{p_{b_i}}\,\,\,\,\,\,\text{if}\,\,\,p_{b_i}\,\,>\,\,0$$If $p_{b_i}=0$, namely, the bin is empty, assign a high score:
$$O(x_i) = \infty$$5. Normalization (optional):
Normalize the outlier scores to a designed range (say, 0 to 1) for interpretation:
$$O_{\text{norm}}(x_i)=\frac{O(x_i) - O_{\text{min}}}{O_{\text{max}}-O_{\text{min}}}$$where $O_{\text{min}}$ and $O_{\text{max}}$ are the minimum and maximum outlier scores across all data points.
6. Interpretation:
High Outlier Score indicates that the data point is rare or unusual compared to the distribution of the rest of the data.
Low Outlier Score: conveys that the data point is common and falls within the expected range of the data distribution.
The characteristics of HBOS:¶
Univariate Analysis:
HBOS treats each feature independently, meaning it evaluates the distribution of each attribute (e.g., temperature, humidity) separately rather than considering the interaction between multiple attributes.
Histogram Construction:
For each feature, HBOS constructs a histogram to approximate its distribution. The dataset is divided into a number of bins (intervals), and the frequency (or density) of data points in each bin is calculated.
The bins might be equidistant (fixed width) or variable in width, depending on the distribution of the data.
Calculating Outlier Scores:
After constructing histograms for each feature, HBOS assigns an outlier score for each value in the dataset based on the inverse density of the bin in which the value falls.
If a value falls into a bin with low density (fewer data points), it receives a higher outlier score, indicating it is an anomaly for that particular feature.
Conversely, if the value falls into a bin with high density, it gets a lower score, indicating it is more common.
Aggregating Scores Across Features:
Since HBOS evaluates each feature separately, it aggregates the outlier scores from all features to determine the final outlier score for each row.
Aggregation can be done in various ways, but a common approach is to multiply the outlier scores of each feature. This approach assumes independence between features, which may not always be true but keeps the method simple and efficient.
Contamination:
In the context of the HBOS (and other anomaly detection models), contamination is a parameter that represents the expected proportion of outliers in the dataset. It is a way for the algorithm to understand how many data points it should consider as outliers.
The parameter guides the model on how many data points should be classified as outliers. It helps the algorithm determine a threshold for the outlier scores to classify points as either outliers or inliers. When contamination = 0.05, it means the model assumes that 5% of the data are outliers.
The contamination is always a decimal; it's not the outlier (1) - inlier (0) classification.
The value of contamination resides from 0 to 1. A value like 0.05 (5%) indicates that you expect around 5% of your dataset to be anomalous.
A value of 0 (contamination score, not class) doesn't make sense in practice.
A value of 1 (contamination score, not class) would mean all points are outliers (also impractical).
- The Outlier Score, different to the contanimation has the general rule of thumb:
An outlier score from 0 and 0.5 conveys an inlier. Anything above such range to be dubbed an outlier; higher values beyond 0.5 indicate stronger devations from the distribution.
Steps in the Process¶
- Data Preparation
Will be similar to what was done with LOF.
- Normalize the Data
Normalize only the relevant numerical columns that will be used for HBOS (e.g., temperature, humidity, pressure, wind speed, etc.).
- HBOS Algorithm Implementation
Create a binary classification column called HBOS_class based on the results.
- Logistic Regression
Using the HBOS_class as the target, perform logistic regression to predict the probability of a row (or weather state) being classified as an outlier.
Demonstration with HBOS¶
To now demonstrate HBOS using histograms with conveyance of how contamination (the expected proportion of outliers) affects the results.
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from pyod.models.hbos import HBOS
from pyod.utils.data import generate_data
from pyod.utils.utility import standardizer
# Step 1: Generate Sample Data (replace this with your dataset if needed)
X_train, X_test, y_train, y_test = generate_data(
n_train=200, n_test=100, n_features=2, contamination=0.1, random_state=42
)
# Step 2: Standardize the data
X_train_norm, X_test_norm = standardizer(X_train, X_test)
# Step 3: Initialize HBOS with a contamination level (i.e., the expected fraction of outliers)
contamination = 0.1 # 10% expected outliers
hbos = HBOS(contamination=contamination)
# Step 4: Train the HBOS model
hbos.fit(X_train_norm)
# Step 5: Get Outlier Scores for the test data
y_test_scores = hbos.decision_function(X_test_norm) # higher scores indicate more abnormal
# Step 6: Predict outliers
y_test_pred = hbos.predict(X_test_norm) # 1 indicates an outlier, 0 indicates an inlier
# Step 7: Visualize the histograms for both features
fig, axs = plt.subplots(1, 2, figsize=(12, 5))
# Histogram for the first feature
axs[0].hist(X_train[:, 0], bins=20, color='lightblue', edgecolor='black')
axs[0].set_title('Histogram of Feature 1')
axs[0].set_xlabel('Feature 1 Values')
axs[0].set_ylabel('Frequency')
# Histogram for the second feature
axs[1].hist(X_train[:, 1], bins=20, color='lightgreen', edgecolor='black')
axs[1].set_title('Histogram of Feature 2')
axs[1].set_xlabel('Feature 2 Values')
axs[1].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
# Step 8: Visualize the HBOS results on a scatter plot, showing outliers
plt.figure(figsize=(8, 6))
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test_pred, cmap='coolwarm', marker='o', edgecolor='k')
plt.title('HBOS: Outlier Detection (Red = Outlier, Blue = Inlier)')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Outlier/ Inlier')
plt.show()
print("Outlier Scores:", y_test_scores)
Outlier Scores: [1.636894 1.636894 3.25528788 1.636894 1.636894 0.35447232 0.64240801 1.34895831 0.35447232 0.35447232 1.34895831 2.21636348 0.35447232 2.21636348 2.21636348 1.34895831 2.26080189 0.35447232 1.9284278 0.35447232 1.636894 1.636894 0.64240801 3.34840259 0.35447232 2.26080189 2.21636348 1.9284278 0.35447232 0.35447232 2.26080189 0.64240801 3.25528788 0.35447232 1.9284278 0.35447232 0.35447232 1.34895831 0.64240801 3.2600372 2.21636348 2.21636348 0.64240801 0.64240801 0.35447232 1.34895831 0.35447232 0.35447232 3.83475736 0.64240801 0.64240801 0.64240801 1.9284278 1.34895831 0.35447232 0.35447232 0.64240801 1.9284278 0.64240801 0.35447232 1.636894 0.64240801 0.64240801 1.34895831 3.2600372 2.21636348 2.21636348 2.26080189 3.83475736 1.636894 0.35447232 0.64240801 0.35447232 3.2600372 0.35447232 1.9284278 0.35447232 3.25528788 0.35447232 0.64240801 1.9284278 0.35447232 0.35447232 0.64240801 4.25452319 3.83475736 0.64240801 2.21636348 0.35447232 0.35447232 3.60424531 3.16060346 3.60424531 6.03202404 5.96603178 5.96603178 5.86659804 6.35359539 3.97468511 6.13145778]
Now, to proceed with the Montserrat based data.
from sklearn.preprocessing import StandardScaler
from pyod.models.hbos import HBOS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
HBOS_data = goody_frame.copy()
# Exclude columns for normalization
columns_to_exclude_normalization = ['date']
#Exclude columns from HBOS subjugation
columns_to_exclude_hbos = ['date']
# Get the columns to normalize and apply HBOS
columns_for_hbos = [col for col in HBOS_data.columns if col not in columns_to_exclude_hbos]
columns_for_normalization = [col for col in HBOS_data.columns if col not in columns_to_exclude_normalization]
# Normalize relevant columns
scaler = StandardScaler()
HBOS_data[columns_for_normalization] = scaler.fit_transform(HBOS_data[columns_for_normalization])
# Select the data for HBOS
X_hbos = HBOS_data[columns_for_hbos]
# Apply the HBOS algorithm
hbos = HBOS(contamination=0.05)
hbos.fit(X_hbos)
# Add the HBOS classification column
HBOS_data['HBOS_class'] = hbos.labels_ # 0 for inliers, 1 for outliers
# Show dataframe
print(HBOS_data)
# Step 5: Logistic Regression
# Define features (excluding 'HBOS_class') and the target ('HBOS_class')
X = HBOS_data[columns_for_hbos]
y = HBOS_data['HBOS_class']
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
temperature_2m_mean temperature_2m_max \
date
1980-01-08 04:00:00+00:00 -0.784182 -0.660480
1980-01-09 04:00:00+00:00 -0.880505 -0.844319
1980-01-10 04:00:00+00:00 -1.702018 -1.359068
1980-01-11 04:00:00+00:00 -1.471196 -1.248766
1980-01-12 04:00:00+00:00 -2.588951 -2.167963
... ... ...
2025-06-18 04:00:00+00:00 1.101648 0.815752
2025-06-19 04:00:00+00:00 1.048938 0.742217
2025-06-20 04:00:00+00:00 0.939890 0.852522
2025-06-21 04:00:00+00:00 0.974423 0.631912
2025-06-22 04:00:00+00:00 0.910811 0.631912
temperature_2m_min apparent_temperature_mean \
date
1980-01-08 04:00:00+00:00 -1.056235 -1.186370
1980-01-09 04:00:00+00:00 -1.056235 -1.065703
1980-01-10 04:00:00+00:00 -1.811791 -1.652677
1980-01-11 04:00:00+00:00 -1.367347 -2.180867
1980-01-12 04:00:00+00:00 -2.522900 -2.526479
... ... ...
2025-06-18 04:00:00+00:00 1.261537 0.002126
2025-06-19 04:00:00+00:00 1.217093 -0.089033
2025-06-20 04:00:00+00:00 1.305981 0.183126
2025-06-21 04:00:00+00:00 1.172650 0.325964
2025-06-22 04:00:00+00:00 0.950428 0.066610
apparent_temperature_max apparent_temperature_min \
date
1980-01-08 04:00:00+00:00 -1.182242 -1.125873
1980-01-09 04:00:00+00:00 -1.112033 -0.831593
1980-01-10 04:00:00+00:00 -1.638058 -1.598376
1980-01-11 04:00:00+00:00 -2.405937 -2.075699
1980-01-12 04:00:00+00:00 -2.723019 -2.418514
... ... ...
2025-06-18 04:00:00+00:00 0.049628 0.102581
2025-06-19 04:00:00+00:00 -0.468265 0.061353
2025-06-20 04:00:00+00:00 0.290325 0.196789
2025-06-21 04:00:00+00:00 0.425916 0.440144
2025-06-22 04:00:00+00:00 -0.111114 -0.165209
wind_speed_10m_max et0_fao_evapotranspiration \
date
1980-01-08 04:00:00+00:00 0.981832 -0.657839
1980-01-09 04:00:00+00:00 0.933995 -0.706751
1980-01-10 04:00:00+00:00 0.746366 -1.635292
1980-01-11 04:00:00+00:00 1.716280 0.183671
1980-01-12 04:00:00+00:00 1.418822 -2.302173
... ... ...
2025-06-18 04:00:00+00:00 1.616393 1.365890
2025-06-19 04:00:00+00:00 2.008112 0.797359
2025-06-20 04:00:00+00:00 1.536482 0.693090
2025-06-21 04:00:00+00:00 1.126042 0.880115
2025-06-22 04:00:00+00:00 2.013563 0.895286
rain_sum dew_point_2m_max ... pressure_msl_max \
date ...
1980-01-08 04:00:00+00:00 -0.121174 -0.434959 ... 1.446654
1980-01-09 04:00:00+00:00 -0.265892 -0.502523 ... 1.758364
1980-01-10 04:00:00+00:00 0.126914 -0.502523 ... 1.602493
1980-01-11 04:00:00+00:00 -0.327914 -1.482186 ... 1.342751
1980-01-12 04:00:00+00:00 0.747134 -1.043028 ... 0.667429
... ... ... ... ...
2025-06-18 04:00:00+00:00 -0.410610 0.718682 ... 1.290816
2025-06-19 04:00:00+00:00 -0.369262 0.955152 ... 1.031074
2025-06-20 04:00:00+00:00 -0.410610 1.056497 ... 1.134977
2025-06-21 04:00:00+00:00 -0.431284 0.786244 ... 1.134977
2025-06-22 04:00:00+00:00 -0.224544 0.853806 ... 0.511590
pressure_msl_min relative_humidity_2m_max \
date
1980-01-08 04:00:00+00:00 1.296111 0.608420
1980-01-09 04:00:00+00:00 1.696717 0.659651
1980-01-10 04:00:00+00:00 1.396262 1.206694
1980-01-11 04:00:00+00:00 1.195990 -0.571764
1980-01-12 04:00:00+00:00 0.545021 0.966278
... ... ...
2025-06-18 04:00:00+00:00 1.746777 -0.294515
2025-06-19 04:00:00+00:00 1.646626 -0.231789
2025-06-20 04:00:00+00:00 1.496414 0.384252
2025-06-21 04:00:00+00:00 0.945627 0.117757
2025-06-22 04:00:00+00:00 0.494960 0.429910
relative_humidity_2m_min \
date
1980-01-08 04:00:00+00:00 -0.313262
1980-01-09 04:00:00+00:00 0.043225
1980-01-10 04:00:00+00:00 -0.188164
1980-01-11 04:00:00+00:00 -1.718097
1980-01-12 04:00:00+00:00 0.801040
... ...
2025-06-18 04:00:00+00:00 -0.450565
2025-06-19 04:00:00+00:00 -0.001594
2025-06-20 04:00:00+00:00 -0.292617
2025-06-21 04:00:00+00:00 -0.039553
2025-06-22 04:00:00+00:00 -0.039553
wet_bulb_temperature_2m_max \
date
1980-01-08 04:00:00+00:00 -0.619877
1980-01-09 04:00:00+00:00 -0.707429
1980-01-10 04:00:00+00:00 -0.923192
1980-01-11 04:00:00+00:00 -1.630508
1980-01-12 04:00:00+00:00 -1.448322
... ...
2025-06-18 04:00:00+00:00 0.809145
2025-06-19 04:00:00+00:00 0.985914
2025-06-20 04:00:00+00:00 1.002287
2025-06-21 04:00:00+00:00 0.717535
2025-06-22 04:00:00+00:00 0.838071
wet_bulb_temperature_2m_min \
date
1980-01-08 04:00:00+00:00 -0.556889
1980-01-09 04:00:00+00:00 -0.436828
1980-01-10 04:00:00+00:00 -1.390226
1980-01-11 04:00:00+00:00 -2.212730
1980-01-12 04:00:00+00:00 -1.245028
... ...
2025-06-18 04:00:00+00:00 0.621683
2025-06-19 04:00:00+00:00 0.932243
2025-06-20 04:00:00+00:00 0.521346
2025-06-21 04:00:00+00:00 0.817216
2025-06-22 04:00:00+00:00 0.678709
vapour_pressure_deficit_max \
date
1980-01-08 04:00:00+00:00 0.067544
1980-01-09 04:00:00+00:00 -0.271330
1980-01-10 04:00:00+00:00 -0.318863
1980-01-11 04:00:00+00:00 0.970315
1980-01-12 04:00:00+00:00 -1.144075
... ...
2025-06-18 04:00:00+00:00 0.637523
2025-06-19 04:00:00+00:00 0.202919
2025-06-20 04:00:00+00:00 0.480632
2025-06-21 04:00:00+00:00 0.194522
2025-06-22 04:00:00+00:00 0.194522
soil_temperature_0_to_7cm_mean LOF \
date
1980-01-08 04:00:00+00:00 -0.691370 0.071735
1980-01-09 04:00:00+00:00 -0.741405 0.071735
1980-01-10 04:00:00+00:00 -0.770006 0.071735
1980-01-11 04:00:00+00:00 -0.798600 0.071735
1980-01-12 04:00:00+00:00 -0.827196 0.071735
... ... ...
2025-06-18 04:00:00+00:00 0.132629 0.071735
2025-06-19 04:00:00+00:00 0.114759 0.071735
2025-06-20 04:00:00+00:00 0.109992 0.071735
2025-06-21 04:00:00+00:00 0.121909 0.071735
2025-06-22 04:00:00+00:00 0.138589 0.071735
HBOS_class
date
1980-01-08 04:00:00+00:00 0
1980-01-09 04:00:00+00:00 0
1980-01-10 04:00:00+00:00 0
1980-01-11 04:00:00+00:00 1
1980-01-12 04:00:00+00:00 0
... ...
2025-06-18 04:00:00+00:00 0
2025-06-19 04:00:00+00:00 0
2025-06-20 04:00:00+00:00 0
2025-06-21 04:00:00+00:00 0
2025-06-22 04:00:00+00:00 0
[16603 rows x 23 columns]
# Count the occurrences of each unique value in 'LOF'
counts = HBOS_data['HBOS_class'].value_counts()
print(counts)
HBOS_class 0 15772 1 831 Name: count, dtype: int64
Classification (Logit) Model Based On HBOS¶
# Create and fit the logistic regression model
log_reg = LogisticRegression(max_iter = 1000)
log_reg.fit(X_train, y_train)
# Make predictions
y_pred = log_reg.predict(X_test)
# Evaluate the model
print(classification_report(y_test, y_pred))
# Check the coefficients of the logistic regression model
coefficients = pd.DataFrame({
'Feature': X_train.columns,
'Coefficient': log_reg.coef_[0]
})
print(coefficients.sort_values(by='Coefficient', ascending=False))
precision recall f1-score support
0 0.96 1.00 0.98 3141
1 0.70 0.19 0.30 180
accuracy 0.95 3321
macro avg 0.83 0.59 0.64 3321
weighted avg 0.94 0.95 0.94 3321
Feature Coefficient
3 apparent_temperature_mean 1.791999
18 wet_bulb_temperature_2m_min 1.206207
6 wind_speed_10m_max 1.141004
13 pressure_msl_max 0.795647
5 apparent_temperature_min 0.392854
20 soil_temperature_0_to_7cm_mean 0.384362
8 rain_sum 0.213222
7 et0_fao_evapotranspiration 0.140622
17 wet_bulb_temperature_2m_max 0.044277
4 apparent_temperature_max -0.137123
21 LOF -0.195141
14 pressure_msl_min -0.243705
19 vapour_pressure_deficit_max -0.274131
9 dew_point_2m_max -0.284780
11 surface_pressure_max -0.338216
15 relative_humidity_2m_max -0.359617
12 surface_pressure_min -0.428306
2 temperature_2m_min -0.505661
0 temperature_2m_mean -0.557331
1 temperature_2m_max -0.915928
16 relative_humidity_2m_min -1.116497
10 dew_point_2m_min -1.441105
INTERPRETATION OF THE RESULTS:
Accuracy = 0.95: 95% of all predictions are correct.
Precision for class 1 (0.70): When the model predicts class 1, it's correct 70% of the time.
Recall for class 1 (0.19): The model only identifies 19% of the actual class 1 cases.
F1-score for class 1 (0.30): Low — suggests poor balance between precision and recall.
High precision but low recall for class 1: Model is conservative in predicting class 1 — when it does predict it, it's often right, but it misses most actual instances (false negatives are high).
This is a common issue in imbalanced classification problems.
Positive coefficients → increase likelihood of class 1
Negative coefficients → decrease likelihood of class 1
Higher mean apparent temperature, wet bulb temp, and wind speed are associated with increased probability of class 1. These might indicate weather stress, influencing your event target.
Lower dew point, humidity, and temperature max/mean decrease the likelihood of class 1. Possibly indicating that cooler, drier conditions are linked with class 0 (non-events).
Attempt to balance the data set:
from imblearn.over_sampling import SMOTE
from sklearn.metrics import classification_report
# Apply SMOTE to oversample the minority class
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
# Train the logistic regression model again
log_reg_balanced = LogisticRegression(max_iter=1000, random_state=42)
log_reg_balanced.fit(X_train_balanced, y_train_balanced)
# Evaluate the model
y_pred_balanced = log_reg_balanced.predict(X_test)
print(classification_report(y_test, y_pred_balanced))
# Check the coefficients of the logistic regression model
coefficients = pd.DataFrame({
'Feature': X_train_balanced.columns,
'Coefficient': log_reg_balanced.coef_[0]
})
print(coefficients.sort_values(by='Coefficient', ascending=False))
precision recall f1-score support
0 0.98 0.83 0.90 3141
1 0.21 0.78 0.33 180
accuracy 0.83 3321
macro avg 0.60 0.80 0.62 3321
weighted avg 0.94 0.83 0.87 3321
Feature Coefficient
3 apparent_temperature_mean 3.077397
13 pressure_msl_max 2.507190
6 wind_speed_10m_max 1.933673
18 wet_bulb_temperature_2m_min 1.754571
5 apparent_temperature_min 1.170021
19 vapour_pressure_deficit_max 0.749079
20 soil_temperature_0_to_7cm_mean 0.559690
8 rain_sum 0.309812
9 dew_point_2m_max 0.279644
7 et0_fao_evapotranspiration 0.084072
17 wet_bulb_temperature_2m_max -0.011749
21 LOF -0.289692
12 surface_pressure_min -0.425241
16 relative_humidity_2m_min -0.514414
4 apparent_temperature_max -0.589974
14 pressure_msl_min -0.759746
15 relative_humidity_2m_max -0.770981
2 temperature_2m_min -0.951287
1 temperature_2m_max -1.346989
11 surface_pressure_max -1.492041
0 temperature_2m_mean -1.499307
10 dew_point_2m_min -2.166019
INTERPRETATION --
Overall Performance: Accuracy: 0.83 — 83% of predictions are correct.
Macro Average F1: 0.62 — average F1 across both classes, treating them equally.
Weighted Average F1: 0.87 — average F1 weighted by class frequency (heavily influenced by class 0).
Recall for class 1 jumped from 0.19 → 0.78 ❗
Precision for class 1 dropped from 0.70 → 0.21
Accuracy dropped from 0.95 → 0.83
CLASS 1 (event/rare class): High recall (0.78): The model now catches most actual class 1 cases (few false negatives).
Low precision (0.21): But many of the predicted class 1s are wrong (many false positives).
F1-score (0.33): Modest — model is better at detecting events, but at the cost of many false alarms.
CLASS 0: High precision (0.98) and decent recall (0.83) — it still performs well, but not as perfectly as before.
ACHIEVEMENT: A recall-optimized model that detects more of the rare class (class 1).
Useful if the application values sensitivity over specificity.
Trade-offs: Higher false positives (lower precision) → might need post-processing or human review on class 1s.
Accuracy fell (because you’re now calling many more samples as class 1), but this is expected in imbalanced problems when optimizing recall.
CONSIDERATIONS: Support Vector Machine (classification) or ensemble methods (Random Forest or XGBoost) generally yield better performance, but at the cost of not have a explicity (analytical) model like the logit case; latter often desired by mathematical "calligraphy" enthusiasts.
Outlier Counts With Recent Data¶
To now observe oultier counters in the months of January and July based on the HBOS and "Prophet" algorithms:
HBOS_data.info()
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 16603 entries, 1980-01-08 04:00:00+00:00 to 2025-06-22 04:00:00+00:00 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 temperature_2m_mean 16603 non-null float64 1 temperature_2m_max 16603 non-null float64 2 temperature_2m_min 16603 non-null float64 3 apparent_temperature_mean 16603 non-null float64 4 apparent_temperature_max 16603 non-null float64 5 apparent_temperature_min 16603 non-null float64 6 wind_speed_10m_max 16603 non-null float64 7 et0_fao_evapotranspiration 16603 non-null float64 8 rain_sum 16603 non-null float64 9 dew_point_2m_max 16603 non-null float64 10 dew_point_2m_min 16603 non-null float64 11 surface_pressure_max 16603 non-null float64 12 surface_pressure_min 16603 non-null float64 13 pressure_msl_max 16603 non-null float64 14 pressure_msl_min 16603 non-null float64 15 relative_humidity_2m_max 16603 non-null float64 16 relative_humidity_2m_min 16603 non-null float64 17 wet_bulb_temperature_2m_max 16603 non-null float64 18 wet_bulb_temperature_2m_min 16603 non-null float64 19 vapour_pressure_deficit_max 16603 non-null float64 20 soil_temperature_0_to_7cm_mean 16603 non-null float64 21 LOF 16603 non-null float64 22 HBOS_class 16603 non-null int32 dtypes: float64(22), int32(1) memory usage: 3.0 MB
HBOS_data_reset = HBOS_data.reset_index()
print(HBOS_data_reset.columns)
Index(['date', 'temperature_2m_mean', 'temperature_2m_max',
'temperature_2m_min', 'apparent_temperature_mean',
'apparent_temperature_max', 'apparent_temperature_min',
'wind_speed_10m_max', 'et0_fao_evapotranspiration', 'rain_sum',
'dew_point_2m_max', 'dew_point_2m_min', 'surface_pressure_max',
'surface_pressure_min', 'pressure_msl_max', 'pressure_msl_min',
'relative_humidity_2m_max', 'relative_humidity_2m_min',
'wet_bulb_temperature_2m_max', 'wet_bulb_temperature_2m_min',
'vapour_pressure_deficit_max', 'soil_temperature_0_to_7cm_mean', 'LOF',
'HBOS_class'],
dtype='object')
from prophet import Prophet
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Ensure 'date' column is datetime
HBOS_data_reset['date'] = pd.to_datetime(HBOS_data_reset['date'])
# Extract year and month from the 'date' column
HBOS_data_reset['year'] = HBOS_data_reset['date'].dt.year
HBOS_data_reset['month'] = HBOS_data_reset['date'].dt.month
# Filter for HBOS outliers in January and July only
jan_july_HBOS_outliers = HBOS_data_reset[
(HBOS_data_reset['HBOS_class'] == 1) &
(HBOS_data_reset['month'].isin([1, 7]))
]
# Check if there are any rows in the filtered dataset
if jan_july_HBOS_outliers.empty:
print("No outliers found for January and July. Please check the dataset or filtering criteria.")
else:
# Group by year and month to count outliers
outlier_counts = jan_july_HBOS_outliers.groupby(['year', 'month']).size().reset_index(name='outlier_count')
# Create a 'date' column for Prophet (first day of each month)
outlier_counts['date'] = pd.to_datetime(outlier_counts[['year', 'month']].assign(day=1))
# Prepare data for Prophet: 'ds' and 'y'
prophet_data = outlier_counts[['date', 'outlier_count']].rename(columns={'date': 'ds', 'outlier_count': 'y'})
# Check for sufficient data
if prophet_data.dropna().shape[0] < 2:
print("The dataset has less than 2 non-NaN rows. Not enough data for Prophet model.")
else:
# Fit Prophet model
model = Prophet()
model.fit(prophet_data)
# Forecast 24 months into the future
future_dates = model.make_future_dataframe(periods=24, freq='MS') # 'MS' = Month Start
forecast = model.predict(future_dates)
# Clip negative values and round forecasts to nearest integer
forecast['yhat'] = forecast['yhat'].clip(lower=0).round()
forecast['yhat_lower'] = forecast['yhat_lower'].clip(lower=0).round()
forecast['yhat_upper'] = forecast['yhat_upper'].clip(lower=0).round()
# Filter forecast for January and July
forecast_jan_july = forecast[forecast['ds'].dt.month.isin([1, 7])]
# Plot actual and forecasted data
plt.figure(figsize=(12, 6))
plt.plot(prophet_data['ds'], prophet_data['y'], marker='o', linestyle='-', label='Actual Outlier Count')
plt.plot(forecast_jan_july['ds'], forecast_jan_july['yhat'], marker='o', linestyle='--',
color='orange', label='Forecasted Outlier Count')
plt.fill_between(forecast_jan_july['ds'],
forecast_jan_july['yhat_lower'],
forecast_jan_july['yhat_upper'],
color='orange', alpha=0.3, label='Forecast CI')
plt.xlabel('Date')
plt.ylabel('Outlier Count')
plt.title('HBOS Outlier Counts for January & July with 2-Year Forecast')
plt.xticks(rotation=45)
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
23:08:37 - cmdstanpy - INFO - Chain [1] start processing 23:08:37 - cmdstanpy - INFO - Chain [1] done processing
Hourly Meteorological Data¶
Hourly meteorological climate data provides detailed information about weather conditions at specific intervals throughout the day. This type of data is essential for understanding short-term weather patterns, tracking changes in atmospheric conditions, and supporting various applications such as weather forecasting, climate modeling, and environmental research.
Key components of hourly meteorological data typically include:
Temperature: Measures the air temperature at a specific height.
Humidity: Indicates the amount of water vapor in the air.
Pressure: Measures the atmospheric pressure.
Wind speed and direction: Describes the movement of air.
Precipitation: Records the amount and type of precipitation (e.g., rain, snow, hail).
Solar radiation: Measures the amount of solar energy reaching the Earth's surface.
Cloud cover: Indicates the percentage of the sky covered by clouds.
Hourly data is collected from various sources, including:
Weather stations: Ground-based stations equipped with sensors to measure different meteorological parameters.
Satellites: Remote sensing satellites that provide observations of the Earth's atmosphere and surface.
Aircraft: Equipped with instruments to collect data during flights.
Applications of hourly meteorological data:
Weather forecasting: Provides the basis for short-term and local weather forecasts.
Environmental research: Supports studies on air quality, water resources, and ecological processes.
Agriculture: Assists farmers in making decisions about planting, irrigation, and harvesting.
Energy: Helps manage energy demand and supply based on weather conditions.
Transportation: Aids in planning and operations of transportation systems, especially those affected by weather (e.g., aviation, shipping).
Challenges and Considerations:
Data quality: Ensuring the accuracy and reliability of hourly data is crucial for its applications.
Data availability: Not all locations have access to comprehensive hourly data, especially in remote or developing regions.
Data assimilation: Combining data from different sources and using data assimilation techniques to improve the quality and consistency of the data.
The model becomes more complex, requiring more computational power and advanced techniques to manage larger datasets and higher-frequency noise.
Overfitting Risk: High-resolution data might lead to overfitting, especially if the true patterns are smoother or less variable over time.
More Noise: Hourly data can introduce more noise due to short-term fluctuations or measurement errors, which may not be as important for long-term forecasting.
Hourly data provides more detail and can capture short-term fluctuations in weather conditions, such as temperature changes or wind speed variations that occur throughout the day.
If the objective is to make predictions for the next few hours, hourly data is more suitable.
Many meteorological variables (e.g., temperature, humidity) have strong diurnal patterns that can only be captured with high-frequency data. Such types of variables also capture atmospheric physics and chemistry aspects.
With hourly data, you typically have more data points, which can improve model training if the data is clean and well-processed.
Of consequence, use of daily meteorological data will be directed towards supervised learning and ensemble learning methods.
Hourly Meteorological Attributes¶
Variable Description
temperature_2m - Instant °C (°F): Air temperature at 2 meters above ground.
relative_humidity_2m - Instant (%): Relative humidity at 2 meters above ground.
dew_point_2m - Instant °C (°F): Dew point temperature at 2 meters above ground.
apparent_temperature - Instant °C (°F): Apparent temperature is the perceived feels-like temperature combining wind chill factor, relative humidity and solar radiation.
pressure_msl and surface_pressure - Instant (hPa): Atmospheric air pressure reduced to mean sea level (msl), and pressure at surface. Typically pressure on mean sea level is used in meteorology. Surface pressure gets lower with increasing elevation.
rain - Preceding hour sum mm (inch): Only liquid precipitation of the preceding hour including local showers and rain from large scale systems.
shortwave_radiation- Preceding hour mean (W/m²): Shortwave solar radiation as average of the preceding hour. This is equal to the total global horizontal irradiation.
direct_radiation and direct_normal_irradiance - Preceding hour mean (W/m²): Direct solar radiation as average of the preceding hour on the horizontal plane and the normal plane (perpendicular to the sun).
diffuse_radiation - Preceding hour mean (W/m²): Diffuse solar radiation as average of the preceding hour.
direct_normal_irradiance_instant (W/m²): any given instant refers to the amount of solar radiation received per unit area by a surface that is held perpendicular to the sun's rays at that specific moment.
terrestrial_radiation_instant (W/m²): this usually refers to the instantaneous terrestrial (longwave) radiation at the Earth’s surface. It represents the longwave radiation emitted by the Earth’s surface at a specific instant in time.
Terrestrial radiation is part of the Earth’s surface energy balance — it’s the infrared energy emitted by the surface as it cools.
wind_speed_10m and wind_speed_100m - Instant km/h (mph, m/s, knots): Wind speed at 10 or 100 meters above ground. Wind speed on 10 meters is the standard level.
et0_fao_evapotranspiration - Preceding hour sum mm (inch):ET₀ Reference Evapotranspiration of a well watered grass field. Based on FAO-56 Penman-Monteith equations ET₀ is calculated from temperature, wind speed, humidity and solar radiation. Unlimited soil water is assumed. ET₀ is commonly used to estimate the required irrigation for plants.
vapour_pressure_deficit - Instant (kPa): Vapor Pressure Deificit (VPD) in kilopascal (kPa). For high VPD (>1.6), water transpiration of plants increases. For low VPD (<0.4), transpiration decreases.
{soil_temperature_0_to_7cm; soil_temperature_7_to_28cm; soil_temperature_28_to_100cm; soil_temperature_100_to_255cm } - Instant °C (°F):Average temperature of different soil levels below ground.
{soil_moisture_0_to_7cm; soil_moisture_7_to_28cm; soil_moisture_28_to_100cm; soil_moisture_100_to_255cm} - Instant (m³/m³): Average soil water content as volumetric mixing ratio at 0-7, 7-28, 28-100 and 100-255 cm depths.
total_column_integrated_water_vapour - represents the total amount of water vapor in a vertical column of the atmosphere, typically expressed in kg/m² or mm
boundary_layer_height - the depth of the lowest part of the atmosphere that is directly influenced by the Earth's surface. This layer is characterized by turbulence and the exchange of heat, moisture, and momentum between the surface and the atmosphere. Its height varies depending on factors like time of day, season, and surface conditions, but it typically ranges from a few hundred meters to a few kilometers.
import openmeteo_requests
import pandas as pd
import requests_cache
from retry_requests import retry
# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)
# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
"latitude": 16.7425,
"longitude": -62.1874,
"start_date": "2022-01-08",
"end_date": "2025-06-24",
"hourly": ["temperature_2m", "relative_humidity_2m", "dew_point_2m", "apparent_temperature", "rain", "pressure_msl", "surface_pressure", "et0_fao_evapotranspiration", "vapour_pressure_deficit", "wind_speed_10m", "wind_speed_100m", "soil_temperature_0_to_7cm", "soil_temperature_7_to_28cm", "soil_moisture_0_to_7cm", "soil_moisture_7_to_28cm", "boundary_layer_height", "wet_bulb_temperature_2m", "shortwave_radiation_instant", "direct_radiation_instant", "diffuse_radiation_instant", "direct_normal_irradiance_instant", "terrestrial_radiation_instant", "total_column_integrated_water_vapour", "albedo"],
"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)
# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")
# Process hourly data. The order of variables needs to be the same as requested.
hourly = response.Hourly()
hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
hourly_relative_humidity_2m = hourly.Variables(1).ValuesAsNumpy()
hourly_dew_point_2m = hourly.Variables(2).ValuesAsNumpy()
hourly_apparent_temperature = hourly.Variables(3).ValuesAsNumpy()
hourly_rain = hourly.Variables(4).ValuesAsNumpy()
hourly_pressure_msl = hourly.Variables(5).ValuesAsNumpy()
hourly_surface_pressure = hourly.Variables(6).ValuesAsNumpy()
hourly_et0_fao_evapotranspiration = hourly.Variables(7).ValuesAsNumpy()
hourly_vapour_pressure_deficit = hourly.Variables(8).ValuesAsNumpy()
hourly_wind_speed_10m = hourly.Variables(9).ValuesAsNumpy()
hourly_wind_speed_100m = hourly.Variables(10).ValuesAsNumpy()
hourly_soil_temperature_0_to_7cm = hourly.Variables(11).ValuesAsNumpy()
hourly_soil_temperature_7_to_28cm = hourly.Variables(12).ValuesAsNumpy()
hourly_soil_moisture_0_to_7cm = hourly.Variables(13).ValuesAsNumpy()
hourly_soil_moisture_7_to_28cm = hourly.Variables(14).ValuesAsNumpy()
hourly_boundary_layer_height = hourly.Variables(15).ValuesAsNumpy()
hourly_wet_bulb_temperature_2m = hourly.Variables(16).ValuesAsNumpy()
hourly_shortwave_radiation_instant = hourly.Variables(17).ValuesAsNumpy()
hourly_direct_radiation_instant = hourly.Variables(18).ValuesAsNumpy()
hourly_diffuse_radiation_instant = hourly.Variables(19).ValuesAsNumpy()
hourly_direct_normal_irradiance_instant = hourly.Variables(20).ValuesAsNumpy()
hourly_terrestrial_radiation_instant = hourly.Variables(21).ValuesAsNumpy()
hourly_total_column_integrated_water_vapour = hourly.Variables(22).ValuesAsNumpy()
hourly_cloud_cover_mid = hourly.Variables(24).ValuesAsNumpy()
hourly_data = {"date": pd.date_range(
start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
end = pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
freq = pd.Timedelta(seconds = hourly.Interval()),
inclusive = "left"
)}
hourly_data["temperature_2m"] = hourly_temperature_2m
hourly_data["relative_humidity_2m"] = hourly_relative_humidity_2m
hourly_data["dew_point_2m"] = hourly_dew_point_2m
hourly_data["apparent_temperature"] = hourly_apparent_temperature
hourly_data["rain"] = hourly_rain
hourly_data["pressure_msl"] = hourly_pressure_msl
hourly_data["surface_pressure"] = hourly_surface_pressure
hourly_data["et0_fao_evapotranspiration"] = hourly_et0_fao_evapotranspiration
hourly_data["vapour_pressure_deficit"] = hourly_vapour_pressure_deficit
hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
hourly_data["wind_speed_100m"] = hourly_wind_speed_100m
hourly_data["soil_temperature_0_to_7cm"] = hourly_soil_temperature_0_to_7cm
hourly_data["soil_temperature_7_to_28cm"] = hourly_soil_temperature_7_to_28cm
hourly_data["soil_moisture_0_to_7cm"] = hourly_soil_moisture_0_to_7cm
hourly_data["soil_moisture_7_to_28cm"] = hourly_soil_moisture_7_to_28cm
hourly_data["boundary_layer_height"] = hourly_boundary_layer_height
hourly_data["wet_bulb_temperature_2m"] = hourly_wet_bulb_temperature_2m
hourly_data["shortwave_radiation_instant"] = hourly_shortwave_radiation_instant
hourly_data["direct_radiation_instant"] = hourly_direct_radiation_instant
hourly_data["diffuse_radiation_instant"] = hourly_diffuse_radiation_instant
hourly_data["direct_normal_irradiance_instant"] = hourly_direct_normal_irradiance_instant
hourly_data["terrestrial_radiation_instant"] = hourly_terrestrial_radiation_instant
hourly_data["total_column_integrated_water_vapour"] = hourly_total_column_integrated_water_vapour
hourly_data["cloud_cover_mid"] = hourly_cloud_cover_mid
hourly_dataframe = pd.DataFrame(data = hourly_data)
print(hourly_dataframe)
hourly_dataframe.info()
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
date temperature_2m relative_humidity_2m \
0 2022-01-08 04:00:00+00:00 23.249001 71.679909
1 2022-01-08 05:00:00+00:00 22.598999 76.695610
2 2022-01-08 06:00:00+00:00 22.348999 76.176575
3 2022-01-08 07:00:00+00:00 21.848999 79.526360
4 2022-01-08 08:00:00+00:00 22.098999 80.060951
... ... ... ...
30331 2025-06-24 23:00:00+00:00 NaN NaN
30332 2025-06-25 00:00:00+00:00 NaN NaN
30333 2025-06-25 01:00:00+00:00 NaN NaN
30334 2025-06-25 02:00:00+00:00 NaN NaN
30335 2025-06-25 03:00:00+00:00 NaN NaN
dew_point_2m apparent_temperature rain pressure_msl \
0 17.848999 21.988255 0.0 1018.500000
1 18.299000 21.672226 0.0 1018.299988
2 17.949001 20.790890 0.0 1017.599976
3 18.148998 20.710756 0.1 1017.500000
4 18.499001 20.978884 0.1 1017.400024
... ... ... ... ...
30331 NaN NaN NaN NaN
30332 NaN NaN NaN NaN
30333 NaN NaN NaN NaN
30334 NaN NaN NaN NaN
30335 NaN NaN NaN NaN
surface_pressure et0_fao_evapotranspiration vapour_pressure_deficit \
0 982.982544 0.094050 0.807554
1 982.713318 0.071562 0.638901
2 982.008240 0.079427 0.643319
3 981.852722 0.061376 0.536290
4 981.785583 0.061776 0.530283
... ... ... ...
30331 NaN NaN NaN
30332 NaN NaN NaN
30333 NaN NaN NaN
30334 NaN NaN NaN
30335 NaN NaN NaN
... soil_moisture_7_to_28cm boundary_layer_height \
0 ... 0.07 805.0
1 ... 0.07 805.0
2 ... 0.07 750.0
3 ... 0.07 795.0
4 ... 0.07 835.0
... ... ... ...
30331 ... NaN NaN
30332 ... NaN NaN
30333 ... NaN NaN
30334 ... NaN NaN
30335 ... NaN NaN
wet_bulb_temperature_2m shortwave_radiation_instant \
0 19.534161 0.0
1 19.589806 0.0
2 19.283928 0.0
3 19.243179 0.0
4 19.552332 0.0
... ... ...
30331 NaN NaN
30332 NaN NaN
30333 NaN NaN
30334 NaN NaN
30335 NaN NaN
direct_radiation_instant diffuse_radiation_instant \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
... ... ...
30331 NaN NaN
30332 NaN NaN
30333 NaN NaN
30334 NaN NaN
30335 NaN NaN
direct_normal_irradiance_instant terrestrial_radiation_instant \
0 0.0 0.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
... ... ...
30331 NaN 0.0
30332 NaN 0.0
30333 NaN 0.0
30334 NaN 0.0
30335 NaN 0.0
total_column_integrated_water_vapour cloud_cover_mid
0 33.200001 0
1 33.400002 0
2 33.500000 0
3 33.700001 0
4 33.299999 0
... ... ...
30331 NaN 0
30332 NaN 0
30333 NaN 0
30334 NaN 0
30335 NaN 0
[30336 rows x 25 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30336 entries, 0 to 30335
Data columns (total 25 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 30336 non-null datetime64[ns, UTC]
1 temperature_2m 30309 non-null float32
2 relative_humidity_2m 30309 non-null float32
3 dew_point_2m 30309 non-null float32
4 apparent_temperature 30309 non-null float32
5 rain 30309 non-null float32
6 pressure_msl 30309 non-null float32
7 surface_pressure 30309 non-null float32
8 et0_fao_evapotranspiration 30309 non-null float32
9 vapour_pressure_deficit 30309 non-null float32
10 wind_speed_10m 30309 non-null float32
11 wind_speed_100m 30309 non-null float32
12 soil_temperature_0_to_7cm 30309 non-null float32
13 soil_temperature_7_to_28cm 30309 non-null float32
14 soil_moisture_0_to_7cm 30309 non-null float32
15 soil_moisture_7_to_28cm 30309 non-null float32
16 boundary_layer_height 25941 non-null float32
17 wet_bulb_temperature_2m 30309 non-null float32
18 shortwave_radiation_instant 30309 non-null float32
19 direct_radiation_instant 30309 non-null float32
20 diffuse_radiation_instant 30309 non-null float32
21 direct_normal_irradiance_instant 30309 non-null float32
22 terrestrial_radiation_instant 30336 non-null float32
23 total_column_integrated_water_vapour 25941 non-null float32
24 cloud_cover_mid 30336 non-null int64
dtypes: datetime64[ns, UTC](1), float32(23), int64(1)
memory usage: 3.1 MB
hourly_dataframe_clean = hourly_dataframe.dropna()
hourly_dataframe_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 25941 entries, 0 to 30308 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 25941 non-null datetime64[ns, UTC] 1 temperature_2m 25941 non-null float32 2 relative_humidity_2m 25941 non-null float32 3 dew_point_2m 25941 non-null float32 4 apparent_temperature 25941 non-null float32 5 rain 25941 non-null float32 6 pressure_msl 25941 non-null float32 7 surface_pressure 25941 non-null float32 8 et0_fao_evapotranspiration 25941 non-null float32 9 vapour_pressure_deficit 25941 non-null float32 10 wind_speed_10m 25941 non-null float32 11 wind_speed_100m 25941 non-null float32 12 soil_temperature_0_to_7cm 25941 non-null float32 13 soil_temperature_7_to_28cm 25941 non-null float32 14 soil_moisture_0_to_7cm 25941 non-null float32 15 soil_moisture_7_to_28cm 25941 non-null float32 16 boundary_layer_height 25941 non-null float32 17 wet_bulb_temperature_2m 25941 non-null float32 18 shortwave_radiation_instant 25941 non-null float32 19 direct_radiation_instant 25941 non-null float32 20 diffuse_radiation_instant 25941 non-null float32 21 direct_normal_irradiance_instant 25941 non-null float32 22 terrestrial_radiation_instant 25941 non-null float32 23 total_column_integrated_water_vapour 25941 non-null float32 24 cloud_cover_mid 25941 non-null int64 dtypes: datetime64[ns, UTC](1), float32(23), int64(1) memory usage: 2.9 MB
hourly_dataframe_clean.isna().sum()
date 0 temperature_2m 0 relative_humidity_2m 0 dew_point_2m 0 apparent_temperature 0 rain 0 pressure_msl 0 surface_pressure 0 et0_fao_evapotranspiration 0 vapour_pressure_deficit 0 wind_speed_10m 0 wind_speed_100m 0 soil_temperature_0_to_7cm 0 soil_temperature_7_to_28cm 0 soil_moisture_0_to_7cm 0 soil_moisture_7_to_28cm 0 boundary_layer_height 0 wet_bulb_temperature_2m 0 shortwave_radiation_instant 0 direct_radiation_instant 0 diffuse_radiation_instant 0 direct_normal_irradiance_instant 0 terrestrial_radiation_instant 0 total_column_integrated_water_vapour 0 cloud_cover_mid 0 dtype: int64
Summary Statistics for Hourly data¶
# Drop the first column and determinie summary statistics.
hourly_data_sans_first_col = hourly_dataframe_clean.iloc[:, 1:]
hourly_summary_stats = hourly_data_sans_first_col.describe()
print(hourly_summary_stats)
temperature_2m relative_humidity_2m dew_point_2m \
count 25941.000000 25941.000000 25941.000000
mean 25.154356 74.224205 20.160091
std 1.548129 7.545050 2.046124
min 19.398998 35.666039 10.199000
25% 24.098999 69.562462 18.648998
50% 25.199001 74.784309 20.398998
75% 26.348999 79.880203 21.799000
max 29.598999 94.986580 24.199001
apparent_temperature rain pressure_msl surface_pressure \
count 25941.000000 25941.000000 25941.000000 25941.000000
mean 25.536901 0.098462 1014.861694 979.691528
std 2.717877 0.474637 2.277580 2.160366
min 17.866764 0.000000 1003.400024 968.797668
25% 23.569269 0.000000 1013.500000 978.373596
50% 25.550934 0.000000 1015.000000 979.846619
75% 27.414621 0.100000 1016.400024 981.179199
max 34.207329 18.799999 1021.900024 986.293640
et0_fao_evapotranspiration vapour_pressure_deficit wind_speed_10m \
count 25941.000000 25941.000000 25941.000000
mean 0.207622 0.830922 26.346403
std 0.173408 0.270467 8.088293
min 0.000000 0.144631 0.360000
25% 0.074283 0.632275 21.612743
50% 0.118418 0.807082 26.693459
75% 0.346611 0.981912 31.698402
max 0.731624 2.260390 64.005127
... soil_moisture_7_to_28cm boundary_layer_height \
count ... 25941.000000 25941.000000
mean ... 0.046259 776.942505
std ... 0.053182 201.177765
min ... 0.000000 115.000000
25% ... 0.000000 645.000000
50% ... 0.000000 770.000000
75% ... 0.082000 900.000000
max ... 0.353000 1880.000000
wet_bulb_temperature_2m shortwave_radiation_instant \
count 25941.000000 25941.000000
mean 21.658060 245.739502
std 1.639583 320.305603
min 16.066504 0.000000
25% 20.349979 0.000000
50% 21.830799 0.000000
75% 23.056477 516.396912
max 24.967825 1028.832153
direct_radiation_instant diffuse_radiation_instant \
count 25941.000000 25941.000000
mean 184.713196 61.026295
std 259.038788 79.678406
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 370.946777 113.992241
max 925.033020 449.724121
direct_normal_irradiance_instant terrestrial_radiation_instant \
count 25941.000000 25941.000000
mean 272.184357 401.996582
std 333.275879 488.299744
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 11.177355
75% 613.130493 910.017212
max 1010.743530 1349.767578
total_column_integrated_water_vapour cloud_cover_mid
count 25941.000000 25941.0
mean 39.281063 0.0
std 9.275361 0.0
min 16.200001 0.0
25% 32.200001 0.0
50% 38.700001 0.0
75% 46.000000 0.0
max 72.699997 0.0
[8 rows x 24 columns]
Skew and Kurtosis¶
import scipy.stats as stats
#Skew and kurtosis
skewness_hourly = hourly_data_sans_first_col.skew()
kurtosis_hourly = hourly_data_sans_first_col.kurtosis()
print("Skewness:")
print(skewness_hourly)
print("\nKurtosis:")
print(kurtosis_hourly)
Skewness: temperature_2m -0.196995 relative_humidity_2m -0.509483 dew_point_2m -0.428209 apparent_temperature 0.071160 rain 15.350304 pressure_msl -0.420989 surface_pressure -0.409151 et0_fao_evapotranspiration 0.892362 vapour_pressure_deficit 0.702015 wind_speed_10m -0.264995 wind_speed_100m -0.340911 soil_temperature_0_to_7cm 1.467343 soil_temperature_7_to_28cm 0.589537 soil_moisture_0_to_7cm 2.719525 soil_moisture_7_to_28cm 0.925203 boundary_layer_height 0.420913 wet_bulb_temperature_2m -0.279724 shortwave_radiation_instant 0.892665 direct_radiation_instant 1.098373 diffuse_radiation_instant 1.302339 direct_normal_irradiance_instant 0.658273 terrestrial_radiation_instant 0.686273 total_column_integrated_water_vapour 0.233287 cloud_cover_mid 0.000000 dtype: float64 Kurtosis: temperature_2m -0.447634 relative_humidity_2m 0.118305 dew_point_2m -0.424773 apparent_temperature -0.449123 rain 344.158569 pressure_msl 0.306904 surface_pressure 0.336213 et0_fao_evapotranspiration -0.599063 vapour_pressure_deficit 0.576414 wind_speed_10m 0.280692 wind_speed_100m 0.293887 soil_temperature_0_to_7cm 2.663844 soil_temperature_7_to_28cm 0.020899 soil_moisture_0_to_7cm 10.640102 soil_moisture_7_to_28cm 0.636145 boundary_layer_height 0.949319 wet_bulb_temperature_2m -0.895999 shortwave_radiation_instant -0.774997 direct_radiation_instant -0.281340 diffuse_radiation_instant 1.431475 direct_normal_irradiance_instant -1.255075 terrestrial_radiation_instant -1.190991 total_column_integrated_water_vapour -0.460983 cloud_cover_mid 0.000000 dtype: float64
Histograms For Hourly Data¶
NOTE: Quantile-Quantile plots like for daily data will not be entertained for hourly data, because hourly data is very large in volume. For example, three years will have more instances than 4 decades.
import matplotlib.pyplot as plt
import seaborn as sns
# Get the column names
column_names = hourly_data_sans_first_col.columns
print(column_names)
column_names_list = column_names.tolist()
# Calculating the number of ros and columns for subplots.
num_cols = 3 # 3 columns
num_rows = (len(column_names_list) + num_cols - 1) // num_cols
# Calculating the number of rows
# Creating subplots
fig, axes = plt.subplots(num_rows, num_cols, figsize = (15, 10))
# Flatten if required.
if num_rows > 1:
axes = axes.flatten()
# Plot the histograms
for i, col in enumerate(column_names_list):
sns.histplot(data = hourly_data_sans_first_col[col], ax = axes[i], kde = True)
axes[i].set_title(f'Histogram of {col}')
axes[i].set_xlabel('Value')
axes[i].set_ylabel('Frequency')
axes[i].grid(True)
# Adjust layout
plt.tight_layout()
plt.show()
Index(['temperature_2m', 'relative_humidity_2m', 'dew_point_2m',
'apparent_temperature', 'rain', 'pressure_msl', 'surface_pressure',
'et0_fao_evapotranspiration', 'vapour_pressure_deficit',
'wind_speed_10m', 'wind_speed_100m', 'soil_temperature_0_to_7cm',
'soil_temperature_7_to_28cm', 'soil_moisture_0_to_7cm',
'soil_moisture_7_to_28cm', 'boundary_layer_height',
'wet_bulb_temperature_2m', 'shortwave_radiation_instant',
'direct_radiation_instant', 'diffuse_radiation_instant',
'direct_normal_irradiance_instant', 'terrestrial_radiation_instant',
'total_column_integrated_water_vapour', 'cloud_cover_mid'],
dtype='object')
Correlation Analysis for Hourly Data¶
# Applying pearson correlation to the data set.
import matplotlib.pyplot as plt
import seaborn as sns
pearson_corr_hourly = hourly_dataframe_clean.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (20, 16))
sns.heatmap(pearson_corr_hourly, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation Heatmap for Hourly Data')
plt.savefig('heatmap.pdf', format='pdf')
plt.show()
For Pearson correlation applied prior, the above Pearson correlation heatmap conveys much about association among the variables, and level of linearity. Again, naturally, data pairs don't need to have a linear relationship in general.
For semi-diurnal (12 hour case) will choose the 17th of September 2024, being the day when the moon is most active, from 4 PM to 4 AM September 18th, 2024. Atmospheric tides show periodic behavior. Use Fourier analysis to detect these periodicities, say, a general 12 hour periodic setting:
Short-Term Forecasting With Hourly Data¶
Similar to what was done with the Prophet algorithm for long-term forecasting with daily data, will now be done with hourly data for short-term forecasting.
from prophet import Prophet
# Create a copy of the original DataFrame
df_copy = hourly_dataframe_clean.copy()
# Check the column names (Debug line to verify the data structure)
print(df_copy.columns)
# Define the columns you are interested in
target_columns = ['apparent_temperature', 'temperature_2m',
'relative_humidity_2m', 'dew_point_2m', 'pressure_msl',
'boundary_layer_height', 'et0_fao_evapotranspiration',
'wet_bulb_temperature_2m', 'vapour_pressure_deficit']
# Proceed with forecasting for each selected column in the copied DataFrame
for col in target_columns: # Limit iteration to the chosen columns
# Create a temporary DataFrame with "date" as 'ds' and the target column as 'y'
df_temp = df_copy[['date', col]].rename(columns={'date': 'ds', col: 'y'})
# Remove timezone information from the 'ds' column
df_temp['ds'] = df_temp['ds'].dt.tz_localize(None) # Make datetime naive
# Initialize and fit Prophet model
model = Prophet()
model.fit(df_temp)
# Create future dataframe for forecasting (next 8 hours, for example)
future = model.make_future_dataframe(periods=8, freq='h')
# Generate forecast
forecast = model.predict(future)
# Output the forecast for this column
print(f"Forecast for {col}:")
print(forecast[['ds', 'yhat']].tail(8)) # Only showing the forecast for the next 8 periods
Index(['date', 'temperature_2m', 'relative_humidity_2m', 'dew_point_2m',
'apparent_temperature', 'rain', 'pressure_msl', 'surface_pressure',
'et0_fao_evapotranspiration', 'vapour_pressure_deficit',
'wind_speed_10m', 'wind_speed_100m', 'soil_temperature_0_to_7cm',
'soil_temperature_7_to_28cm', 'soil_moisture_0_to_7cm',
'soil_moisture_7_to_28cm', 'boundary_layer_height',
'wet_bulb_temperature_2m', 'shortwave_radiation_instant',
'direct_radiation_instant', 'diffuse_radiation_instant',
'direct_normal_irradiance_instant', 'terrestrial_radiation_instant',
'total_column_integrated_water_vapour', 'cloud_cover_mid'],
dtype='object')
23:09:22 - cmdstanpy - INFO - Chain [1] start processing 23:09:51 - cmdstanpy - INFO - Chain [1] done processing
Forecast for apparent_temperature:
ds yhat
25941 2025-06-24 01:00:00 23.520416
25942 2025-06-24 02:00:00 23.381653
25943 2025-06-24 03:00:00 23.230667
25944 2025-06-24 04:00:00 23.109773
25945 2025-06-24 05:00:00 23.010959
25946 2025-06-24 06:00:00 22.903244
25947 2025-06-24 07:00:00 22.782459
25948 2025-06-24 08:00:00 22.687054
23:09:57 - cmdstanpy - INFO - Chain [1] start processing 23:10:19 - cmdstanpy - INFO - Chain [1] done processing
Forecast for temperature_2m:
ds yhat
25941 2025-06-24 01:00:00 25.058977
25942 2025-06-24 02:00:00 24.973953
25943 2025-06-24 03:00:00 24.857244
25944 2025-06-24 04:00:00 24.736486
25945 2025-06-24 05:00:00 24.636620
25946 2025-06-24 06:00:00 24.551876
25947 2025-06-24 07:00:00 24.460657
25948 2025-06-24 08:00:00 24.367929
23:10:27 - cmdstanpy - INFO - Chain [1] start processing 23:10:44 - cmdstanpy - INFO - Chain [1] done processing
Forecast for relative_humidity_2m:
ds yhat
25941 2025-06-24 01:00:00 80.488718
25942 2025-06-24 02:00:00 80.966981
25943 2025-06-24 03:00:00 81.477646
25944 2025-06-24 04:00:00 81.913806
25945 2025-06-24 05:00:00 82.209654
25946 2025-06-24 06:00:00 82.436414
25947 2025-06-24 07:00:00 82.712208
25948 2025-06-24 08:00:00 83.010652
23:10:50 - cmdstanpy - INFO - Chain [1] start processing 23:11:06 - cmdstanpy - INFO - Chain [1] done processing
Forecast for dew_point_2m:
ds yhat
25941 2025-06-24 01:00:00 21.372553
25942 2025-06-24 02:00:00 21.390714
25943 2025-06-24 03:00:00 21.390129
25944 2025-06-24 04:00:00 21.371335
25945 2025-06-24 05:00:00 21.339052
25946 2025-06-24 06:00:00 21.301854
25947 2025-06-24 07:00:00 21.268606
25948 2025-06-24 08:00:00 21.243605
23:11:13 - cmdstanpy - INFO - Chain [1] start processing 23:11:44 - cmdstanpy - INFO - Chain [1] done processing
Forecast for pressure_msl:
ds yhat
25941 2025-06-24 01:00:00 1019.488802
25942 2025-06-24 02:00:00 1019.823980
25943 2025-06-24 03:00:00 1019.807950
25944 2025-06-24 04:00:00 1019.455391
25945 2025-06-24 05:00:00 1018.911814
25946 2025-06-24 06:00:00 1018.377475
25947 2025-06-24 07:00:00 1018.016779
25948 2025-06-24 08:00:00 1017.911495
23:11:50 - cmdstanpy - INFO - Chain [1] start processing 23:12:11 - cmdstanpy - INFO - Chain [1] done processing
Forecast for boundary_layer_height:
ds yhat
25941 2025-06-24 01:00:00 926.688918
25942 2025-06-24 02:00:00 926.878159
25943 2025-06-24 03:00:00 926.378244
25944 2025-06-24 04:00:00 923.726626
25945 2025-06-24 05:00:00 917.254922
25946 2025-06-24 06:00:00 907.935746
25947 2025-06-24 07:00:00 899.915152
25948 2025-06-24 08:00:00 897.750750
23:12:18 - cmdstanpy - INFO - Chain [1] start processing 23:12:25 - cmdstanpy - INFO - Chain [1] done processing
Forecast for et0_fao_evapotranspiration:
ds yhat
25941 2025-06-24 01:00:00 0.073359
25942 2025-06-24 02:00:00 0.076604
25943 2025-06-24 03:00:00 0.073166
25944 2025-06-24 04:00:00 0.066872
25945 2025-06-24 05:00:00 0.064847
25946 2025-06-24 06:00:00 0.067964
25947 2025-06-24 07:00:00 0.069845
25948 2025-06-24 08:00:00 0.065331
23:12:33 - cmdstanpy - INFO - Chain [1] start processing 23:12:52 - cmdstanpy - INFO - Chain [1] done processing
Forecast for wet_bulb_temperature_2m:
ds yhat
25941 2025-06-24 01:00:00 22.342584
25942 2025-06-24 02:00:00 22.327333
25943 2025-06-24 03:00:00 22.287810
25944 2025-06-24 04:00:00 22.234396
25945 2025-06-24 05:00:00 22.180021
25946 2025-06-24 06:00:00 22.128784
25947 2025-06-24 07:00:00 22.078052
25948 2025-06-24 08:00:00 22.030733
23:12:58 - cmdstanpy - INFO - Chain [1] start processing 23:13:08 - cmdstanpy - INFO - Chain [1] done processing
Forecast for vapour_pressure_deficit:
ds yhat
25941 2025-06-24 01:00:00 0.605932
25942 2025-06-24 02:00:00 0.587156
25943 2025-06-24 03:00:00 0.565609
25944 2025-06-24 04:00:00 0.546271
25945 2025-06-24 05:00:00 0.532887
25946 2025-06-24 06:00:00 0.523015
25947 2025-06-24 07:00:00 0.511349
25948 2025-06-24 08:00:00 0.498137
Apparent Temperature: The Temperature You Feel, Not Just the Temperature on the Thermometer¶
Apparent temperature, often referred to as the "feels like" temperature, is a measure of how hot or cold it feels outside, taking into account factors beyond just the air temperature. These factors include humidity and wind speed, which significantly impact our body's ability to regulate temperature.
When the air is humid, sweat, our body's natural cooling mechanism, evaporates less efficiently. This makes it harder for our bodies to cool down, leading to a higher perceived temperature. Conversely, when the air is dry, sweat evaporates more readily, making us feel cooler.
Wind chill, on the other hand, is the effect of wind on the perceived temperature when it's cold. As wind speeds increase, it accelerates heat loss from our bodies, making us feel colder than the actual air temperature.
Feature Selection for Apparent Temperature with Hourly Meteorological Data¶
Now, there's interest in developing predictive models involving possible relationships between atmospheric physical/chemical attributes instead of predictions based on chronologically sequenced of the atmospheric physical/chemical attributes. In similar fashion to feature selection for daily meteorological data, feature selection will also be done for hourly meteorological data.
For the hourly data set will focus on 'apparent_temperature' due to:
- Computational expense and time dealing with hourly data over multiple years.
- The 'apparent_temperature' target will be applied for later on development.
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
# Assuming your DataFrame is named 'hourly_data_sans_first_col'
# Define features and target variable
X = hourly_data_sans_first_col.drop(columns=['apparent_temperature']) # Features
y = hourly_data_sans_first_col['apparent_temperature'] # Target
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize Random Forest Regressor
rf_model = RandomForestRegressor(n_estimators=50, random_state=42)
# Fit the model
rf_model.fit(X_train, y_train)
# Get feature importances
importances = rf_model.feature_importances_
# Create a DataFrame for feature importances
feature_importances = pd.DataFrame({
'Feature': X.columns,
'Importance': importances
}).sort_values(by='Importance', ascending=False)
# Plot feature importances
plt.figure(figsize=(12, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.title('Feature Importances from Random Forest')
plt.gca().invert_yaxis() # Invert y-axis to have the most important feature on top
plt.show()
# Print ranked features based on importance
print("Ranked Features based on Importance:")
print(feature_importances)
# Recursive Feature Elimination (RFE)
rfe = RFE(estimator=rf_model, n_features_to_select=5) # Select top 5 features
rfe.fit(X_train, y_train)
# Get selected features
selected_features = X.columns[rfe.support_]
print("Selected Features by RFE:")
print(selected_features)
Ranked Features based on Importance:
Feature Importance
0 temperature_2m 0.593167
9 wind_speed_100m 0.165724
15 wet_bulb_temperature_2m 0.162016
8 wind_speed_10m 0.031364
6 et0_fao_evapotranspiration 0.019404
14 boundary_layer_height 0.011443
16 shortwave_radiation_instant 0.006382
17 direct_radiation_instant 0.003906
2 dew_point_2m 0.001083
20 terrestrial_radiation_instant 0.000916
10 soil_temperature_0_to_7cm 0.000650
11 soil_temperature_7_to_28cm 0.000600
21 total_column_integrated_water_vapour 0.000589
1 relative_humidity_2m 0.000575
7 vapour_pressure_deficit 0.000501
19 direct_normal_irradiance_instant 0.000311
5 surface_pressure 0.000300
4 pressure_msl 0.000265
18 diffuse_radiation_instant 0.000255
13 soil_moisture_7_to_28cm 0.000237
12 soil_moisture_0_to_7cm 0.000210
3 rain 0.000100
22 cloud_cover_mid 0.000000
Selected Features by RFE:
Index(['temperature_2m', 'et0_fao_evapotranspiration', 'wind_speed_10m',
'wind_speed_100m', 'wet_bulb_temperature_2m'],
dtype='object')
# Applying pearson correlation to the data set.
appar_hourly = hourly_data_sans_first_col[['temperature_2m', 'wind_speed_100m', 'wet_bulb_temperature_2m',
'wind_speed_10m', 'et0_fao_evapotranspiration',
'boundary_layer_height', 'direct_radiation_instant']]
appar_corr = appar_hourly.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(appar_corr, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation Heatmap for Apparent Temperature')
plt.savefig('heatmap.pdf', format='pdf')
plt.show()
Based on the importance/rank of the features, along with the correlation heatmap in consideration, some features will be dropped to possibly treat multicollinearity issues.
To now examine the quality of the resulting quantile regression model.
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Assuming your DataFrame is named 'hourly_data_sans_first_col'
# Define features and target variable
X = hourly_data_sans_first_col[['temperature_2m', 'wind_speed_100m',
'wet_bulb_temperature_2m',
'et0_fao_evapotranspiration', 'boundary_layer_height']]
y = hourly_data_sans_first_col[['apparent_temperature']]
# Add a constant to the model (intercept)
X = sm.add_constant(X)
# Fit the quantile regression model for the 0.5 quantile (median)
quantiles = [0.25, 0.5, 0.75] # Define quantiles of interest
models = {}
for q in quantiles:
model = sm.QuantReg(y, X)
results = model.fit(q=q)
models[q] = results
print(f"Quantile Regression Results for q={q}:")
print(results.summary())
print("\n")
Quantile Regression Results for q=0.25:
QuantReg Regression Results
================================================================================
Dep. Variable: apparent_temperature Pseudo R-squared: 0.8877
Model: QuantReg Bandwidth: 0.05118
Method: Least Squares Sparsity: 0.6802
Date: Fri, 27 Jun 2025 No. Observations: 25941
Time: 23:19:06 Df Residuals: 25935
Df Model: 5
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -6.1065 0.032 -192.606 0.000 -6.169 -6.044
temperature_2m 0.7347 0.003 293.507 0.000 0.730 0.740
wind_speed_100m -0.1276 0.000 -490.656 0.000 -0.128 -0.127
wet_bulb_temperature_2m 0.7634 0.002 315.794 0.000 0.759 0.768
et0_fao_evapotranspiration 1.3379 0.011 120.982 0.000 1.316 1.360
boundary_layer_height -0.0001 1.49e-05 -7.564 0.000 -0.000 -8.34e-05
==============================================================================================
The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Quantile Regression Results for q=0.5:
QuantReg Regression Results
================================================================================
Dep. Variable: apparent_temperature Pseudo R-squared: 0.8800
Model: QuantReg Bandwidth: 0.05505
Method: Least Squares Sparsity: 0.8144
Date: Fri, 27 Jun 2025 No. Observations: 25941
Time: 23:19:07 Df Residuals: 25935
Df Model: 5
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -4.6365 0.047 -97.663 0.000 -4.730 -4.543
temperature_2m 0.6652 0.004 181.683 0.000 0.658 0.672
wind_speed_100m -0.1294 0.000 -356.885 0.000 -0.130 -0.129
wet_bulb_temperature_2m 0.7780 0.004 219.616 0.000 0.771 0.785
et0_fao_evapotranspiration 2.2097 0.018 120.089 0.000 2.174 2.246
boundary_layer_height 3.946e-05 2.05e-05 1.926 0.054 -6.97e-07 7.96e-05
==============================================================================================
The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Quantile Regression Results for q=0.75:
QuantReg Regression Results
================================================================================
Dep. Variable: apparent_temperature Pseudo R-squared: 0.8850
Model: QuantReg Bandwidth: 0.05190
Method: Least Squares Sparsity: 0.7975
Date: Fri, 27 Jun 2025 No. Observations: 25941
Time: 23:19:08 Df Residuals: 25935
Df Model: 5
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -4.6079 0.038 -122.370 0.000 -4.682 -4.534
temperature_2m 0.6597 0.003 207.722 0.000 0.654 0.666
wind_speed_100m -0.1265 0.000 -413.064 0.000 -0.127 -0.126
wet_bulb_temperature_2m 0.7859 0.003 251.546 0.000 0.780 0.792
et0_fao_evapotranspiration 2.8413 0.018 157.897 0.000 2.806 2.877
boundary_layer_height -5.767e-06 1.68e-05 -0.343 0.731 -3.87e-05 2.72e-05
==============================================================================================
The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
As for any remaining multicollinearity issue(s), the reasonable "pruning" would be dropping the 'wet_bulb_temperature_2m' feature.
import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
# Assuming your DataFrame is named 'hourly_data_sans_first_col'
# Define features and target variable
X = hourly_data_sans_first_col[['temperature_2m', 'wind_speed_100m',
'et0_fao_evapotranspiration', 'boundary_layer_height']]
y = hourly_data_sans_first_col[['apparent_temperature']]
# Add a constant to the model (intercept)
X = sm.add_constant(X)
# Fit the quantile regression model for the 0.5 quantile (median)
quantiles = [0.25, 0.5, 0.75] # Define quantiles of interest
models = {}
for q in quantiles:
model = sm.QuantReg(y, X)
results = model.fit(q=q)
models[q] = results
print(f"Quantile Regression Results for q={q}:")
print(results.summary())
print("\n")
Quantile Regression Results for q=0.25:
QuantReg Regression Results
================================================================================
Dep. Variable: apparent_temperature Pseudo R-squared: 0.7829
Model: QuantReg Bandwidth: 0.08148
Method: Least Squares Sparsity: 1.595
Date: Fri, 27 Jun 2025 No. Observations: 25941
Time: 23:19:08 Df Residuals: 25936
Df Model: 4
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -4.9731 0.076 -65.062 0.000 -5.123 -4.823
temperature_2m 1.4022 0.003 473.345 0.000 1.396 1.408
wind_speed_100m -0.0885 0.001 -166.070 0.000 -0.090 -0.087
et0_fao_evapotranspiration -0.6781 0.024 -27.887 0.000 -0.726 -0.630
boundary_layer_height -0.0031 2.47e-05 -124.506 0.000 -0.003 -0.003
==============================================================================================
The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Quantile Regression Results for q=0.5:
QuantReg Regression Results
================================================================================
Dep. Variable: apparent_temperature Pseudo R-squared: 0.7756
Model: QuantReg Bandwidth: 0.09265
Method: Least Squares Sparsity: 1.374
Date: Fri, 27 Jun 2025 No. Observations: 25941
Time: 23:19:08 Df Residuals: 25936
Df Model: 4
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -4.4653 0.080 -55.978 0.000 -4.622 -4.309
temperature_2m 1.3925 0.003 450.059 0.000 1.386 1.399
wind_speed_100m -0.0925 0.001 -168.418 0.000 -0.094 -0.091
et0_fao_evapotranspiration 0.2445 0.028 8.848 0.000 0.190 0.299
boundary_layer_height -0.0030 2.62e-05 -114.357 0.000 -0.003 -0.003
==============================================================================================
The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Quantile Regression Results for q=0.75:
QuantReg Regression Results
================================================================================
Dep. Variable: apparent_temperature Pseudo R-squared: 0.7652
Model: QuantReg Bandwidth: 0.08407
Method: Least Squares Sparsity: 1.985
Date: Fri, 27 Jun 2025 No. Observations: 25941
Time: 23:19:09 Df Residuals: 25936
Df Model: 4
==============================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------------
const -4.0197 0.102 -39.508 0.000 -4.219 -3.820
temperature_2m 1.3826 0.004 352.417 0.000 1.375 1.390
wind_speed_100m -0.0961 0.001 -133.772 0.000 -0.097 -0.095
et0_fao_evapotranspiration 1.2045 0.039 31.252 0.000 1.129 1.280
boundary_layer_height -0.0028 3.38e-05 -82.909 0.000 -0.003 -0.003
==============================================================================================
The condition number is large, 1.52e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Observed is a drastic drop in Psuedo R-squared, hence, the 'wet_bulb_temperature_2m' feature should be retained. Yet, observing the remiaing "high" value correlation pairs, the multicollinearity concern can be ignored.
NOTE: the above summary statistics only concerns the applied data set. Say, time orientation and place of interest.
The Heat Index: A Measure of Perceived Temperature¶
The heat index is a measure of how hot it feels outside when the relative humidity is factored in with the air temperature. It combines the effects of both temperature and humidity to provide a more accurate representation of the perceived temperature, especially in hot and humid environments.
When the air temperature rises and the relative humidity increases, the human body's ability to cool itself through perspiration becomes less efficient. This is because the moisture in the air slows down the evaporation process, which is essential for cooling. As a result, the body feels hotter than the actual air temperature.
The heat index is calculated using a mathematical formula that takes into account both temperature and relative humidity. It is expressed in degrees Fahrenheit or Celsius, and it provides a more accurate indication of how hot it feels to a person. A high heat index can pose health risks, especially for vulnerable populations such as the elderly, young children, and those with certain medical conditions.
Understanding the heat index is important for individuals and communities to take appropriate precautions during hot weather. By being aware of the heat index, people can stay hydrated, avoid strenuous activities during peak heat hours, and take steps to protect themselves from heat-related illnesses.
From Anderson et al 2013, the (American) National Weather Service (NWS) uses its own complex algorithm for forecasts and heat warnings, and has created a website that calculates heat index using this algorithm, although only for one heat index value at a time (NWS 2011). It's algorithm:
import math
# Function to calculate the Heat Index (HI)
def calculate_heat_index(T, H):
# Step 1: Check if temperature is less than or equal to 40°F
if T <= 40:
return T
# Step 2: Calculate A
A = -10.3 + 1.1 * T + 0.047 * H
# Step 3: Check if A is less than 79°F
if A < 79:
return A
# Step 4: Calculate B using the full formula
B = (-42.379 + 2.04901523 * T + 10.14333127 * H
- 0.22475541 * T * H - 6.83783 * 10**(-3) * T**2
- 5.481717 * 10**(-2) * H**2 + 1.22874 * 10**(-3) * T**2 * H
+ 8.5282 * 10**(-4) * T * H**2 - 1.99 * 10**(-6) * T**2 * H**2)
# Step 5: Check specific conditions for further adjustments
if H <= 13 and 80 <= T <= 112:
B -= ((13 - H) / 4) * math.sqrt((17 - abs(T - 95)) / 17)
return B
if H > 85 and 80 <= T <= 87:
B += 0.02 * (H - 85) * (87 - T)
return B
# Step 6: Default case
return B
# Example usage
T = 90 # Example temperature in °F
H = 70 # Example relative humidity in %
heat_index = calculate_heat_index(T, H)
print(f"The Heat Index is: {round(heat_index, 2)} °F")
The Heat Index is: 105.92 °F
Now to covert the above algorithm to the celsius measure:
# Function to calculate the Heat Index (HI) with temperature in Celsius
def calculate_heat_index_cel(T_celsius, H):
# Convert temperature from Celsius to Fahrenheit
T = (T_celsius * 9/5) + 32
# Step 1: Check if temperature is less than or equal to 40°F
if T <= 40:
return T_celsius
# Step 2: Calculate A
A = -10.3 + 1.1 * T + 0.047 * H
# Step 3: Check if A is less than 79°F
if A < 79:
return (A - 32) * 5/9 # Convert back to Celsius
# Step 4: Calculate B using the full formula
B = (-42.379 + 2.04901523 * T + 10.14333127 * H
- 0.22475541 * T * H - 6.83783 * 10**(-3) * T**2
- 5.481717 * 10**(-2) * H**2 + 1.22874 * 10**(-3) * T**2 * H
+ 8.5282 * 10**(-4) * T * H**2 - 1.99 * 10**(-6) * T**2 * H**2)
# Step 5: Check specific conditions for further adjustments
if H <= 13 and 80 <= T <= 112:
B -= ((13 - H) / 4) * math.sqrt((17 - abs(T - 95)) / 17)
return (B - 32) * 5/9 # Convert back to Celsius
if H > 85 and 80 <= T <= 87:
B += 0.02 * (H - 85) * (87 - T)
return (B - 32) * 5/9 # Convert back to Celsius
# Step 6: Default case
return (B - 32) * 5/9 # Convert back to Celsius
# Example usage
T_celsius = 32.2 # Example temperature in °C (equivalent to 90°F)
H = 70 # Example relative humidity in %
heat_index_celsius = calculate_heat_index_cel(T_celsius, H)
print(f"The Heat Index is: {round(heat_index_celsius, 2)} °C")
The Heat Index is: 41.0 °C
To now observe visually how the (American) NWS Heat Index compares to Apparent Temperature:
hourly_dataframe_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 25941 entries, 0 to 30308 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 25941 non-null datetime64[ns, UTC] 1 temperature_2m 25941 non-null float32 2 relative_humidity_2m 25941 non-null float32 3 dew_point_2m 25941 non-null float32 4 apparent_temperature 25941 non-null float32 5 rain 25941 non-null float32 6 pressure_msl 25941 non-null float32 7 surface_pressure 25941 non-null float32 8 et0_fao_evapotranspiration 25941 non-null float32 9 vapour_pressure_deficit 25941 non-null float32 10 wind_speed_10m 25941 non-null float32 11 wind_speed_100m 25941 non-null float32 12 soil_temperature_0_to_7cm 25941 non-null float32 13 soil_temperature_7_to_28cm 25941 non-null float32 14 soil_moisture_0_to_7cm 25941 non-null float32 15 soil_moisture_7_to_28cm 25941 non-null float32 16 boundary_layer_height 25941 non-null float32 17 wet_bulb_temperature_2m 25941 non-null float32 18 shortwave_radiation_instant 25941 non-null float32 19 direct_radiation_instant 25941 non-null float32 20 diffuse_radiation_instant 25941 non-null float32 21 direct_normal_irradiance_instant 25941 non-null float32 22 terrestrial_radiation_instant 25941 non-null float32 23 total_column_integrated_water_vapour 25941 non-null float32 24 cloud_cover_mid 25941 non-null int64 dtypes: datetime64[ns, UTC](1), float32(23), int64(1) memory usage: 2.9 MB
# If you are working with a filtered DataFrame, make a copy to avoid SettingWithCopyWarning
hi_hourly_meteo_data = hourly_dataframe_clean.copy()
# Apply the calculate_heat_index function to each row using .loc
hi_hourly_meteo_data.loc[:, 'calculated_heat_index_cel'] = hi_hourly_meteo_data.apply(
lambda row: calculate_heat_index_cel(row['temperature_2m'], row['relative_humidity_2m']),
axis=1
)
# Plot comparison between calculated heat index and apparent temperature
plt.figure(figsize=(10, 6))
plt.plot(hi_hourly_meteo_data['date'], hi_hourly_meteo_data['calculated_heat_index_cel'],
label='Calculated Heat Index', color='blue')
plt.plot(hi_hourly_meteo_data['date'], hi_hourly_meteo_data['apparent_temperature'],
label='Apparent Temperature', color='red', linestyle='--')
# Add titles and labels
plt.title('Comparison of Calculated Heat Index Celsius and Apparent Temperature')
plt.xlabel('Date')
plt.ylabel('Temperature (°C)')
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
# Show the plot
plt.tight_layout()
plt.show()
The above exhibit does convey an overall conformity between the algorithmic model (Heat Index) and the realised data.
To now visually observe the differential concerning Heat Index and Apparent Temperature:
# Making a dataframe copy to avoid SettingWithCopyWarning
hix_hourly_meteo_data_new = hi_hourly_meteo_data.copy()
# Step 1: Calculate the difference between 'calculated_heat_index' and 'apparent_temperature'
# Use .loc to avoid SettingWithCopyWarning
hix_hourly_meteo_data_new.loc[:, 'heat_index_difference'] = hix_hourly_meteo_data_new['calculated_heat_index_cel'] - hix_hourly_meteo_data_new['apparent_temperature']
# Step 2: Plot the difference
plt.figure(figsize=(10, 6))
plt.plot(hix_hourly_meteo_data_new['date'],
hix_hourly_meteo_data_new['heat_index_difference'],
label='Difference (Heat Index - Apparent Temperature)', color='green')
# Add titles and labels
plt.title('Difference Between Calculated Heat Index and Apparent Temperature')
plt.xlabel('Date')
plt.ylabel('Temperature Difference (°C)')
plt.axhline(0, color='black', linestyle='--') # Horizontal line at y=0 for reference
plt.grid(True)
plt.xticks(rotation=45)
plt.legend()
# Show the plot
plt.tight_layout() # Adjust layout to make room for rotated x-axis labels
plt.show()
A 6°C difference in environmental temperature (not body temperature), such as a change in weather, might not be as concerning, though it could still be uncomfortable or require adjustments in clothing or activities.
Comparing the Apparent Temperature Data to the Heat Index Model and a Quantile Regression Model for Apparent Temperature¶
The Heat Index is a widely used metric that combines air temperature and humidity to determine the perceived temperature. Quantile Regression, on the other hand, is a statistical method that models the relationship between variables at different quantiles, allowing for a more nuanced understanding of the relationship between apparent temperature and its influencing factors.
By comparing these models, we can gain insights into their strengths, weaknesses, and applicability in different contexts. The Heat Index Model, while simple and widely used, may have limitations in capturing the full complexity of the relationship between temperature and humidity. Quantile Regression, with its ability to model conditional quantiles, can provide a more detailed understanding of how apparent temperature varies across different percentiles of temperature and humidity.
Furthermore, comparing the models to actual apparent temperature data can help assess their accuracy and identify potential biases. This analysis can inform decision-making in areas such as public health, urban planning, and climate change adaptation, where understanding apparent temperature is crucial.
Overall, this comparison provides valuable insights into the different approaches to modeling apparent temperature and highlights the strengths and limitations of each method. By understanding the nuances of these models, researchers and practitioners can make more informed decisions based on accurate and reliable apparent temperature estimates.
The selected features for apparent_temperature from earlier to be applied.
const -6.6007 0.006 -1143.728 0.000 -6.612 -6.589 temperature_2m 0.7305 0.000 1586.255 0.000 0.730 0.731 wind_speed_100m -0.0007 0.000 -3.469 0.001 -0.001 -0.000 wet_bulb_temperature_2m 0.7992 0.000 1764.586 0.000 0.798 0.800 wind_speed_10m -0.1465 0.000 -591.801 0.000 -0.147 -0.146 et0_fao_evapotranspiration 0.4220 0.002 237.996 0.000 0.419 0.426 boundary_layer_height 4.665e-05 2.49e-06 18.717 0.000 4.18e-05 5.15e-05
from statsmodels.tsa.stattools import coint
# Assuming hix_hourly_meteo_data is loaded as a DataFrame
# Step 1: Feature Engineering the Regression Model
hix_hourly_meteo_data_new['app_heat_predict_mod'] = (
-6.6007
+ 0.7305 * hix_hourly_meteo_data_new['temperature_2m']
- 0.0007 * hix_hourly_meteo_data_new['wind_speed_100m']
+ 0.7992 * hix_hourly_meteo_data_new['wet_bulb_temperature_2m']
+ 0.4220 * hix_hourly_meteo_data_new['et0_fao_evapotranspiration']
+ 4.665e-05 * hix_hourly_meteo_data_new['boundary_layer_height']
)
# Step 2: Plotting the Time Series
plt.figure(figsize=(12, 6))
plt.plot(hix_hourly_meteo_data_new.index,
hix_hourly_meteo_data_new['app_heat_predict_mod'],
label='App Heat Predict Mod')
plt.plot(hix_hourly_meteo_data_new.index,
hix_hourly_meteo_data_new['apparent_temperature'],
label='Apparent Temperature')
plt.plot(hix_hourly_meteo_data_new.index,
hix_hourly_meteo_data_new['calculated_heat_index_cel'],
label='Calculated Heat Index_cel')
plt.title('Time Series Plot')
plt.xlabel('Time')
plt.ylabel('Heat Feel Values')
plt.legend()
plt.show()
# Step 3: Plotting the Differential
hix_hourly_meteo_data_new['diff_predict_apparent'] = hix_hourly_meteo_data_new['app_heat_predict_mod'] - hix_hourly_meteo_data_new['apparent_temperature']
hix_hourly_meteo_data_new['diff_predict_calculated'] = hix_hourly_meteo_data_new['app_heat_predict_mod'] - hix_hourly_meteo_data_new['calculated_heat_index_cel']
hix_hourly_meteo_data_new['diff_apparent_calculated'] = hix_hourly_meteo_data_new['apparent_temperature'] - hix_hourly_meteo_data_new['calculated_heat_index_cel']
plt.figure(figsize=(12, 6))
plt.plot(hix_hourly_meteo_data_new.index,
hix_hourly_meteo_data_new['diff_predict_apparent'],
label='Predicted - Apparent')
plt.plot(hix_hourly_meteo_data_new.index,\
hix_hourly_meteo_data_new['diff_predict_calculated'],
label='Predicted - Calculated')
plt.plot(hix_hourly_meteo_data_new.index,
hix_hourly_meteo_data_new['diff_apparent_calculated'],
label='Apparent - Calculated')
plt.title('Differential Time Series Plot')
plt.xlabel('Time')
plt.ylabel('Differential Values')
plt.legend()
plt.show()
# Performing Cointegration Tests
coint_test_apparent = coint(hix_hourly_meteo_data_new['app_heat_predict_mod'],
hix_hourly_meteo_data_new['apparent_temperature'])
coint_test_calculated = coint(hix_hourly_meteo_data_new['app_heat_predict_mod'],
hix_hourly_meteo_data_new['calculated_heat_index_cel'])
coint_test_apparent_calculated = coint(hix_hourly_meteo_data_new['apparent_temperature'],
hix_hourly_meteo_data_new['calculated_heat_index_cel'])
# Outputting the results
print("Cointegration Test Results:")
print("1. App Heat Predict Mod and Apparent Temperature:")
print(f" - Test Statistic: {coint_test_apparent[0]}")
print(f" - p-value: {coint_test_apparent[1]}")
print(f" - Critical Values: {coint_test_apparent[2]}\n")
print("2. App Heat Predict Mod and Calculated Heat Index:")
print(f" - Test Statistic: {coint_test_calculated[0]}")
print(f" - p-value: {coint_test_calculated[1]}")
print(f" - Critical Values: {coint_test_calculated[2]}\n")
print("3. Apparent Temperature and Calculated Heat Index:")
print(f" - Test Statistic: {coint_test_apparent_calculated[0]}")
print(f" - p-value: {coint_test_apparent_calculated[1]}")
print(f" - Critical Values: {coint_test_apparent_calculated[2]}")
Cointegration Test Results: 1. App Heat Predict Mod and Apparent Temperature: - Test Statistic: -10.07281136448165 - p-value: 1.626711589561262e-16 - Critical Values: [-3.89686225 -3.33636556 -3.0446135 ] 2. App Heat Predict Mod and Calculated Heat Index: - Test Statistic: -9.903578849491844 - p-value: 4.346699413691353e-16 - Critical Values: [-3.89686225 -3.33636556 -3.0446135 ] 3. Apparent Temperature and Calculated Heat Index: - Test Statistic: -12.119515906460899 - p-value: 2.0804774298814206e-21 - Critical Values: [-3.89686225 -3.33636556 -3.0446135 ]
Based on the three compared raw times series, the three compared differentials time series, and the cointegration results, the quantile regression model is closer to the apparent_temperature attribute than the heat index. Additionally, for high temperatures the heat index may serve as a better gauge for extreme feel sensation; the quantile regression model for cold temperatures.
Observing Hurricanes: A Blend of Wind Speed and Pressure¶
Hurricanes, nature's most destructive forces, are categorized based on their sustained wind speeds and central atmospheric pressure. These two primary parameters provide a reliable measure of a hurricane's intensity and potential for damage.
The Saffir-Simpson Hurricane Wind Scale
The Saffir-Simpson Hurricane Wind Scale is a widely used classification system that categorizes hurricanes into five categories based on their sustained wind speeds. The higher the category, the more destructive the hurricane.
Category 1: 74-95 mph (119-153 km/h); 64-82 kt
Category 2: 96-110 mph (154-177 km/h); 83-95 kt
Category 3: 111-129 mph (178-209 km/h); 96-112 kt
Category 4: 130-156 mph (209-251 km/h); 113-136 kt
Category 5: 157 mph or higher (252 km/h or higher); 137 kt or higher
Atmospheric Pressure - A Silent Indicator
While wind speed is a visible and often dramatic indicator of a hurricane's strength, atmospheric pressure is a less obvious but equally crucial factor. As a hurricane intensifies, its central atmospheric pressure decreases. Lower pressure indicates a stronger storm, as it signifies a more powerful low-pressure system.
Visualizing Hurricane Tracks with Python and Folium Python, a versatile programming language, offers powerful libraries like Folium for creating interactive maps. By combining data on hurricane tracks, wind speeds, and atmospheric pressure, we can visualize the evolution of these storms over time.
Historical hurricane tracks data is acquired from the Climate Mapping for Resilience and Adaptation (CMRA) resource to develop geospatial projects.
VARIABLES IN THE DATA SET:
SID = Storm Identifier
BASIN = Basin (type or category)
SUBBASIN = Subbaisin (type or category)
NAME = Name (name or title)
LAT = Latitude (coordinate)
LON = Longitude (coordinate)
USA_WIND = Maximum Sustained Wind Speed (knots) 0 - 300 kts
USA_PRES = Minimum Sea Level Pressure (millibars) 850 - 1050 mb
year = Year (integer)
month = Month (integer)
day = Day (integer)
Hurricane_Date = Date (preferably to be in datetime format)
NOTE: to keep interest in visuals there will be a limited number of graphs constructed; the project is generally focused on the NYC area.
Data assimilation and cleaning:
import pandas as pd
import os
print(os.getcwd())
hurricane_data = pd.read_csv(r"C:\Users\verlene\Downloads\Historical_Hurricane_Tracks (1).csv")
hurricane_data.info()
C:\Users\verlene <class 'pandas.core.frame.DataFrame'> RangeIndex: 696780 entries, 0 to 696779 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OBJECTID 696780 non-null int64 1 SID 696780 non-null object 2 BASIN 574121 non-null object 3 SUBBASIN 603264 non-null object 4 NAME 696780 non-null object 5 LAT 696780 non-null float64 6 LON 696780 non-null float64 7 USA_WIND 696780 non-null int64 8 USA_PRES 696780 non-null int64 9 year 696780 non-null int64 10 month 696780 non-null int64 11 day 696780 non-null int64 12 Hurricane_Date 696780 non-null object dtypes: float64(2), int64(6), object(5) memory usage: 69.1+ MB
#Checking for all unique instances
unique_values = hurricane_data['NAME'].unique()
unique_values
array(['NOT_NAMED', 'ANN', 'BETTY', ..., 'YAMANEKO', 'MANDOUG', 'DARIAN'],
dtype=object)
# Dropping null entries
hurricane_data = hurricane_data.dropna()
hurricane_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 574121 entries, 0 to 696779 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OBJECTID 574121 non-null int64 1 SID 574121 non-null object 2 BASIN 574121 non-null object 3 SUBBASIN 574121 non-null object 4 NAME 574121 non-null object 5 LAT 574121 non-null float64 6 LON 574121 non-null float64 7 USA_WIND 574121 non-null int64 8 USA_PRES 574121 non-null int64 9 year 574121 non-null int64 10 month 574121 non-null int64 11 day 574121 non-null int64 12 Hurricane_Date 574121 non-null object dtypes: float64(2), int64(6), object(5) memory usage: 61.3+ MB
Concerns for hurricanes from 1989 to recent.
modern_hurricanes_tracks = hurricane_data[hurricane_data['year'] >= 1989]
modern_hurricanes_tracks
| OBJECTID | SID | BASIN | SUBBASIN | NAME | LAT | LON | USA_WIND | USA_PRES | year | month | day | Hurricane_Date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 474096 | 474097 | 1988364S17148 | SP | EA | DELILAH | -17.82 | 155.96 | 35 | 0 | 1989 | 1 | 1 | 1989/01/01 05:00:00+00 |
| 474097 | 474098 | 1988364S17148 | SP | EA | DELILAH | -17.93 | 156.79 | 40 | 0 | 1989 | 1 | 1 | 1989/01/01 05:00:00+00 |
| 474098 | 474099 | 1988364S17148 | SP | EA | DELILAH | -18.07 | 157.63 | 45 | 0 | 1989 | 1 | 1 | 1989/01/01 05:00:00+00 |
| 474099 | 474100 | 1988364S17148 | SP | EA | DELILAH | -18.19 | 158.54 | 45 | 0 | 1989 | 1 | 1 | 1989/01/01 05:00:00+00 |
| 474100 | 474101 | 1988364S17148 | SP | EA | DELILAH | -18.33 | 159.48 | 45 | 0 | 1989 | 1 | 1 | 1989/01/01 05:00:00+00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 696775 | 696776 | 2022352S12093 | SI | MM | DARIAN | -29.77 | 68.25 | 42 | 1000 | 2022 | 12 | 30 | 2022/12/30 05:00:00+00 |
| 696776 | 696777 | 2022352S12093 | SI | MM | DARIAN | -30.40 | 68.20 | 39 | 1001 | 2022 | 12 | 31 | 2022/12/31 05:00:00+00 |
| 696777 | 696778 | 2022352S12093 | SI | MM | DARIAN | -30.99 | 68.19 | 0 | 0 | 2022 | 12 | 31 | 2022/12/31 05:00:00+00 |
| 696778 | 696779 | 2022357S13130 | SI | WA | ELLIE | -13.30 | 129.80 | 39 | 994 | 2022 | 12 | 22 | 2022/12/22 05:00:00+00 |
| 696779 | 696780 | 2022357S13130 | SI | WA | ELLIE | -13.75 | 129.95 | 37 | 995 | 2022 | 12 | 22 | 2022/12/22 05:00:00+00 |
191038 rows × 13 columns
# Drop duplicates based on 'NAME' and 'Hurricane_Date', keeping the first occurrence
modern_hurricanes_track_unique = modern_hurricanes_tracks.drop_duplicates(subset=['NAME',
'Hurricane_Date'],
keep='first')
# Display the result
print(modern_hurricanes_track_unique)
OBJECTID SID BASIN SUBBASIN NAME LAT LON \
474096 474097 1988364S17148 SP EA DELILAH -17.82 155.96
474104 474105 1988364S17148 SP MM DELILAH -19.40 163.15
474112 474113 1988364S17148 SP MM DELILAH -23.25 168.72
474120 474121 1988364S17148 SP MM DELILAH -28.10 170.80
474128 474129 1988364S17148 SP MM DELILAH -32.10 170.50
... ... ... ... ... ... ... ...
696752 696753 2022352S12093 SI MM DARIAN -19.50 79.20
696760 696761 2022352S12093 SI MM DARIAN -22.40 73.70
696768 696769 2022352S12093 SI MM DARIAN -26.10 70.20
696776 696777 2022352S12093 SI MM DARIAN -30.40 68.20
696778 696779 2022357S13130 SI WA ELLIE -13.30 129.80
USA_WIND USA_PRES year month day Hurricane_Date
474096 35 0 1989 1 1 1989/01/01 05:00:00+00
474104 55 0 1989 1 2 1989/01/02 05:00:00+00
474112 55 0 1989 1 3 1989/01/03 05:00:00+00
474120 45 0 1989 1 4 1989/01/04 05:00:00+00
474128 0 0 1989 1 5 1989/01/05 05:00:00+00
... ... ... ... ... ... ...
696752 60 989 2022 12 28 2022/12/28 05:00:00+00
696760 54 994 2022 12 29 2022/12/29 05:00:00+00
696768 45 1001 2022 12 30 2022/12/30 05:00:00+00
696776 39 1001 2022 12 31 2022/12/31 05:00:00+00
696778 39 994 2022 12 22 2022/12/22 05:00:00+00
[25085 rows x 13 columns]
modern_hurricanes_track_unique.info()
<class 'pandas.core.frame.DataFrame'> Index: 25085 entries, 474096 to 696778 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OBJECTID 25085 non-null int64 1 SID 25085 non-null object 2 BASIN 25085 non-null object 3 SUBBASIN 25085 non-null object 4 NAME 25085 non-null object 5 LAT 25085 non-null float64 6 LON 25085 non-null float64 7 USA_WIND 25085 non-null int64 8 USA_PRES 25085 non-null int64 9 year 25085 non-null int64 10 month 25085 non-null int64 11 day 25085 non-null int64 12 Hurricane_Date 25085 non-null object dtypes: float64(2), int64(6), object(5) memory usage: 2.7+ MB
K-Means Clustering: A Common Tool for Data Analysis¶
K-Means clustering is a common technique in the realm of unsupervised machine learning, a branch of artificial intelligence that delves into unlabeled data. This powerful algorithm is designed to group similar data points together, making it a versatile tool for a wide range of applications.
At its core, K-Means operates through an iterative process:
Initialization: The algorithm begins by randomly selecting K data points as initial centroids, which serve as the starting points for each cluster.
Assignment: Each data point is assigned to the nearest centroid, forming K distinct clusters.
Update Centroids: The centroids of each cluster are recalculated as the mean of all the points assigned to that cluster.
Iteration: Steps 2 and 3 are repeated until convergence, meaning the centroids no longer shift significantly.
While K-Means is a relatively simple algorithm, its effectiveness hinges on careful consideration of several factors:
Choosing the Right K: Determining the optimal number of clusters (K) is a critical decision. Techniques like the Elbow Method and Silhouette Analysis can aid in this process.
Initialization Sensitivity: The initial random selection of centroids can influence the final clustering results. K-Means++ is a popular technique to mitigate this issue.
Outliers: Outliers can distort the clustering process. Robust K-Means algorithms and outlier detection techniques can help address this challenge.
Scalability: For large datasets, K-Means can become computationally expensive. Mini-Batch K-Means is a scalable alternative that processes data in smaller batches.
Mathematical Structure¶
Given a dataset of $n$ points $X = {x_1,x_2,..,x_n}$ in $\mathbb{R}^d$, the goal is to partition $X$ into $K$ clusters such that each data point is assigned to the nearest cluster center, minimizing the sum of squared distances to the nearest centroid:
1. Centroids Initialization:
Initialize $K$ centroids ${\mu_1,\mu_2,..,\mu_K}$ randoomly from the dataset.
2. Assignment Step:
For each data point $x_i$, assign it to the nearest cluster center:
$$c_{i} = \arg\min_{j \in \{1, 2, \ldots, K\}} \left\| x_{i} - \mu_j \right\|^2$$where $c_i$ is the cluster assignment $x_i$, and $\mu_j$ represents the centroid of cluster $j$.
3. Update Step:
After assigning each point, recompute the centroid of each cluster by taking the mean of all points assigned to it:
$$\mu_j = \frac{1}{|C_j|} \sum_{x_i \in C_j} x_i$$where $C_j$ represents the set of points assigned to cluster $j$, and $|C_j|$ is the number of points in cluster $j$.
4. Iterate:
Repeat the assignment and update steps until the centroids converge, which is generally achieved when there is little or no change in the positions of the centroids, or a maximum number of iterations is reached.
OBJECTIVE FUNCTION:
KMeans aims to minimize the Within-Cluster Sum of Squares (WCSS):
$$J = \sum_{j=1}^{K} \sum_{x_i \in C_j} \| x_i - \mu_j \|^2$$A Demonstration of K-Means Clustering¶
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.datasets import make_blobs
Generating synthetic data with make_blobs function:
# Generate synthetic data
from sklearn.datasets import make_blobs
n_samples = 1000
n_clusters = 4
X, y_true = make_blobs(n_samples=n_samples, centers=n_clusters, cluster_std=0.60, random_state=42)
# Convert to DataFrame for easier manipulation
data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
Visualizing the synthetic data before clustering:
# Generate synthetic data
n_samples = 1000
n_clusters = 4
X, y_true = make_blobs(n_samples=n_samples, centers=n_clusters, cluster_std=0.60, random_state=42)
# Convert to DataFrame for easier manipulation
data = pd.DataFrame(X, columns=['Feature_1', 'Feature_2'])
# Check the first few rows of the generated data
print(data.head())
Feature_1 Feature_2 0 -8.668355 7.168180 1 -6.434370 -6.700534 2 -6.544631 -6.834506 3 4.364262 1.463263 4 4.484124 1.071284
Visualizing the synthetic data:
# Visualize the generated data before clustering
plt.figure(figsize=(8, 6))
plt.scatter(data['Feature_1'], data['Feature_2'], s=30, color='blue', marker='o')
plt.title('Generated Synthetic Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.show()
Fitting a K-Means model to the data and predict the cluster for each data point:
# Apply K-Means
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
data['Cluster'] = kmeans.fit_predict(X)
Visualizing the clusters:
plt.figure(figsize=(8, 6))
plt.scatter(data['Feature_1'], data['Feature_2'], c=data['Cluster'], s=30, cmap='viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='red', s=200, alpha=0.75, marker='X') # Mark the centers
plt.title('K-Means Clustering Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.grid(True)
plt.show()
Now, back to the cleaned (real) historical hurricanes data:¶
modern_hurricanes_track_unique.info()
<class 'pandas.core.frame.DataFrame'> Index: 25085 entries, 474096 to 696778 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OBJECTID 25085 non-null int64 1 SID 25085 non-null object 2 BASIN 25085 non-null object 3 SUBBASIN 25085 non-null object 4 NAME 25085 non-null object 5 LAT 25085 non-null float64 6 LON 25085 non-null float64 7 USA_WIND 25085 non-null int64 8 USA_PRES 25085 non-null int64 9 year 25085 non-null int64 10 month 25085 non-null int64 11 day 25085 non-null int64 12 Hurricane_Date 25085 non-null object dtypes: float64(2), int64(6), object(5) memory usage: 2.7+ MB
MiniBatch K-Means¶
Mini-Batch K-Means is a variation of K-Means that addresses the scalability issue by using smaller subsets of data, called mini-batches, in each iteration. This approach significantly reduces computational cost, especially for large datasets.
Key Characteristics:
Mini-Batch Processing: Processes smaller subsets of data in each iteration.
Faster Convergence: Often converges faster than K-Means, especially for large datasets.
Approximation: Due to the use of mini-batches, it might not converge to the same solution as K-Means, but it often provides a good approximation.
Scalability: More scalable than K-Means for large datasets.
Overview
Mini-Batch K-Means Usage:
Large datasets where computational efficiency is crucial.
When a good approximation of the optimal clustering is sufficient.
Online learning scenarios where data arrives in a stream.
Mathematical Structure¶
MiniBatch KMeans stirives to achieve clustering by updating centroids with small, randomly selected "mini-batches' of the data rather than the complete data set in each iteration.
1. MiniBatch Selection:
In each iteration a random set (mini-batch) of $m$ data points $B = {x_{i1}, x_{i2},..., x_{im}}$ is sampled from the full dataset $X$, where $m<n$.
2. Assignment Step:
For each point in the mini-batch $x_{ik}$, assign it to the nearest cluster center based on the squared Euclidean ditance:
$$c_{ik} = \arg\min_{j \in \{1, 2, \ldots, K\}} \left\| x_{ik} - \mu_j \right\|^2$$3. Update Step:
For each cluster $j$ represented in the mini-batch, update its centroid $\mu_j$ based on the points assigned to it in the mini-batch. Using incremental mean update:
$$\mu_j = \\mu_j + \eta(x_{ik} - \mu_j)$$where $\eta$ is the learning rate, generally computed as $\frac{1}{t_j}$, with $t_j$ as the numnber of times cluster $j$ has been updated.
4. Repeat:
Repeat the assignment and update steps until converge criteria are met, typically when centroids stabilize or a set number of iterations are reached.
Performing MiniBatch K-Means Clustering on the (real) historical hurricanes data:
The Elbow Method¶
The elbow method is a technique used to determine the optimal number of clusters (k) in K-Means clustering. It evaluates how the sum of squared distances between data points and their corresponding cluster centroids (called within-cluster sum of squares, WCSS) decreases as the number of clusters increases. The goal is to find the "elbow point," where adding more clusters does not significantly reduce the WCSS.
In K-Means clustering, inertia refers to the within-cluster sum of squares (WCSS). It measures how well the clustering algorithm has grouped the data points within their respective clusters. Specifically, it quantifies how close the data points are to their cluster's centroid.
Definition of Inertia:
Inertia is calculated as the sum of the squared distances between each data point and the centroid of the cluster it belongs to. Mathematically:
$$\text{Inertia} = \sum_{i=1}^{k} \sum_{x \in C_i} \| x - \mu_i \|^2$$Where:
$k$ is the number of clusters;
$C_i$ is the set of cluster points in cluster $i$;
$x$ is a data point;
$\mu$ is the centroid of the cluster $i$.
Interpretation
Low inertia means that the points are close to their centroids, indicating good clustering.
High inertia means that the points are farther from their centroids, which may indicate that the clustering is not well-fitted.
Inertia decreases when increasing the number of clusters, because the data points are split into smaller groups. However, adding more clusters reduces the inertia at diminishing returns. This is why there's use of the elbow method to balance the trade-off between inertia reduction and model simplicity.
The elbow method involves plotting the inertia (WCSS) against the number of clusters (k). As the number of clusters increases, inertia will decrease. However, after a certain point (the elbow), the reduction in inertia becomes negligible, indicating the optimal number of clusters.
For the Elbow Method one should observe where the drastic drop in the curve falters.
Silhouette Score¶
The silhouette score is a metric used to evaluate the quality of clusters formed by a clustering algorithm like K-Means. It measures how similar an object is to its own cluster compared to other clusters. The silhouette score can range from -1 to 1, where:
A score close to 1 indicates that the object is well-clustered and is close to its own cluster center while being far away from other clusters.
A score close to 0 indicates that the object is on or very close to the decision boundary between two neighboring clusters.
A score close to -1 indicates that the object may have been assigned to the wrong cluoptimal.
ong clu For a single data point $i$, the silhouette score $s(i)$ is defined as:
$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$Where:
$a(i)$ is the average intra-cluster distance for point $i$;
$b(i)$ is the average inter-cluster distance for p. $i$.
Explanation of the Code
- Data Loading and Preparation:
The relevant columns are selected from the dataset, and any missing values are dropped.
- Standardization:
Standardizing the data helps K-means perform better by giving each feature equal weight.
- Elbow Method and Silhouette Score:
The Elbow Method is used to determine the optimal number of clusters by plotting inertia (the sum of squared distances from each point to its assigned cluster center).
The Silhouette Score provides insight into how well the clusters are defined.
- K-means Clustering:
After determining the optimal number of clusters, K-means is fitted to the scaled data, and clusters are assigned.
- Visualization:
Clusters are visualized using latitude and longitude for spatial analysis.
- Cluster Analysis:
The mean values for each feature in each cluster are calculated to analyze cluster characteristics.
The outputs observed are the centroids (or means) of clusters generated from a K-Means clustering analysis on a dataset that includes variables related to hurricanes, particularly USA_WIND, USA_PRES, LAT, and LON.
Comprehending the data:
Cluster Index: The first column (the index) labeled Cluster represents the different clusters that K-Means has identified in the dataset. Each number (0, 1, 2, 3) corresponds to a different cluster.
Variables:
USA_WIND: This represents the wind speed associated with the hurricanes (measured in miles per hour or another unit). Higher values indicate stronger winds.
USA_PRES: This indicates the atmospheric pressure (measured in millibars or inches of mercury) associated with the hurricanes. Lower pressure is typically associated with more intense storms.
LAT (Latitude): This indicates the latitude where the hurricanes occurred. Latitude values range from -90 (South Pole) to +90 (North Pole).
LON (Longitude): This indicates the longitude of the hurricane's location, with values ranging from -180 to +180 degrees.
import os
os.environ["OMP_NUM_THREADS"] = "2"
from sklearn.cluster import MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import folium
from folium.plugins import MarkerCluster
data = modern_hurricanes_track_unique[['USA_WIND', 'USA_PRES', 'LAT', 'LON']].dropna()
# Standardize the data
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
# Determine the optimal number of clusters using the Elbow Method
inertia = []
silhouette_scores = []
K_range = range(2, 11)
for k in K_range:
mb_kmeans = MiniBatchKMeans(n_clusters=k, random_state=42, batch_size=1024) # Increase batch size
mb_kmeans.fit(scaled_data)
inertia.append(mb_kmeans.inertia_)
silhouette_scores.append(silhouette_score(scaled_data, mb_kmeans.labels_))
# Plot the Elbow Curve
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(K_range, inertia, marker='o')
plt.title('Elbow Method for Mini-Batch K-Means')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
# Plot the Silhouette Scores
plt.subplot(1, 2, 2)
plt.plot(K_range, silhouette_scores, marker='o')
plt.title('Silhouette Scores for Mini-Batch K-Means')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.tight_layout()
plt.show()
# Choose the number of clusters based on the plots
optimal_k = 6
mb_kmeans = MiniBatchKMeans(n_clusters=optimal_k, random_state=42, batch_size=1024) # Increase batch size
data['Cluster'] = mb_kmeans.fit_predict(scaled_data)
# Calculate bounds for each cluster
bounds = data.groupby('Cluster').agg(
wind_speed_min=('USA_WIND', 'min'),
wind_speed_max=('USA_WIND', 'max'),
pressure_min=('USA_PRES', 'min'),
pressure_max=('USA_PRES', 'max')
).reset_index()
# Display the bounds
print(f"Bounds for Each Cluster:")
print(bounds)
# Count occurrences in 'Cluster'
cluster_counts = data['Cluster'].value_counts().reset_index()
cluster_counts.columns = ['Cluster', 'Counts'] # Rename columns for clarity
print(cluster_counts)
# Visualize the clusters
plt.figure(figsize=(10, 8))
plt.scatter(data['LON'], data['LAT'], c=data['Cluster'], cmap='viridis', alpha=0.5)
plt.title('Storm Clusters Based on Latitude and Longitude (Mini-Batch K-Means)')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.colorbar(label='Cluster')
plt.show()
# Analyze the characteristics of each cluster
cluster_analysis = data.groupby('Cluster').mean()
print(f"Cluster Analysis after EM and SS:")
print(cluster_analysis)
# Explore how homogeneous each cluster is.
cluster_variances = data.groupby('Cluster').var()
print(f"Variances of Clusters after EM and SS:")
print(cluster_variances)
cluster_counts = data.groupby("Cluster").size()
print(f"Cluster Counts After EM and SS:")
print(cluster_counts)
# For a specific feature, plot its distribution across clusters
for column in data.columns:
plt.figure(figsize=(8, 6))
sns.boxplot(x='Cluster', y=column, data=data)
plt.title(f'Distribution of {column} across clusters')
plt.show()
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
# Train a decision tree classifier
clf = DecisionTreeClassifier()
clf.fit(data, mb_kmeans.labels_)
# Plot the tree
plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=data.columns, filled=True)
plt.show()
# Get feature importances
feature_importances = clf.feature_importances_
# Create a DataFrame to show feature names and their importance
feature_importance_df = pd.DataFrame({
'Feature': data.columns,
'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)
# Print feature importance
print(feature_importance_df)
# Visualize feature importances
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['Feature'], feature_importance_df['Importance'], color='skyblue')
plt.xlabel("Importance")
plt.ylabel("Feature")
plt.title("Feature Importances for Predicting Cluster Labels")
plt.gca().invert_yaxis()
plt.show()
from sklearn.ensemble import RandomForestClassifier
# Train a random forest classifier
rf = RandomForestClassifier()
rf.fit(data, mb_kmeans.labels_)
# Get feature importances
importances = rf.feature_importances_
sorted_indices = np.argsort(importances)[::-1]
# Plot feature importances
plt.figure(figsize=(10, 6))
plt.title("Feature Importance (Random Forest)")
plt.bar(range(data.shape[1]), importances[sorted_indices], align="center")
plt.xticks(range(data.shape[1]), np.array(data.columns)[sorted_indices], rotation=90)
plt.tight_layout()
plt.show()
Bounds for Each Cluster: Cluster wind_speed_min wind_speed_max pressure_min pressure_max 0 0 30 150 0 0 1 1 10 70 966 1014 2 2 0 120 0 1021 3 3 0 55 0 0 4 4 0 100 0 1014 5 5 70 165 0 990 Cluster Counts 0 3 8438 1 2 4814 2 1 4261 3 4 3288 4 0 2541 5 5 1743
Cluster Analysis after EM and SS:
USA_WIND USA_PRES LAT LON
Cluster
0 75.591893 0.000000 9.133184 124.719646
1 34.062661 997.228820 17.420854 128.443602
2 41.525343 900.859576 13.580511 -123.049342
3 15.509125 0.000000 1.858607 118.171410
4 39.358273 991.329075 -16.527248 101.829419
5 103.876649 941.353414 12.191733 97.574819
Variances of Clusters after EM and SS:
USA_WIND USA_PRES LAT LON
Cluster
0 659.093619 0.000000 378.297264 1251.944570
1 210.182223 94.176033 66.615281 920.150782
2 463.094620 88182.055490 123.553808 433.829394
3 223.487590 0.000000 463.165816 2287.627721
4 331.382402 2555.716441 34.569747 2022.387449
5 423.557107 4423.740699 234.885437 6802.055453
Cluster Counts After EM and SS:
Cluster
0 2541
1 4261
2 4814
3 8438
4 3288
5 1743
dtype: int64
Feature Importance 4 Cluster 0.735529 1 USA_PRES 0.264067 2 LAT 0.000404 0 USA_WIND 0.000000 3 LON 0.000000
Counts for Hurricane Categories¶
import pandas as pd
# Define a function to categorize based on the Saffir-Simpson Hurricane Wind Scale (in knots)
def categorize_hurricane(wind_speed):
if 64 <= wind_speed <= 82:
return 'Category 1'
elif 83 <= wind_speed <= 95:
return 'Category 2'
elif 96 <= wind_speed <= 112:
return 'Category 3'
elif 113 <= wind_speed <= 136:
return 'Category 4'
elif wind_speed >= 137:
return 'Category 5'
else:
return 'Tropical Storm or Lower'
modern_hurricanes_track_unique = modern_hurricanes_track_unique.copy()
# Apply the categorization function using .loc
modern_hurricanes_track_unique['Category'] = modern_hurricanes_track_unique.loc[:, 'USA_WIND'].apply(categorize_hurricane)
# Group by year and category to count occurrences
category_counts = modern_hurricanes_track_unique.groupby(['year', 'Category']).size().unstack(fill_value=0)
# Display the counts
print(category_counts)
Category Category 1 Category 2 Category 3 Category 4 Category 5 \ year 1989 88 30 25 24 3 1990 88 48 29 26 3 1991 70 44 35 27 4 1992 97 59 34 45 5 1993 66 44 19 21 0 1994 82 41 36 40 6 1995 51 26 22 14 6 1996 77 39 21 30 4 1997 93 40 30 33 21 1998 61 37 14 17 5 1999 37 23 17 15 1 2000 67 26 18 15 2 2001 70 33 19 13 2 2002 55 40 22 36 7 2003 60 28 20 29 3 2004 62 26 35 35 8 2005 51 30 34 22 5 2006 58 30 22 28 6 2007 52 18 14 18 2 2008 49 23 16 12 0 2009 41 22 19 20 7 2010 33 13 16 8 4 2011 44 24 15 15 1 2012 48 31 24 21 2 2013 56 30 10 17 6 2014 55 29 22 19 9 2015 73 47 41 39 8 2016 59 36 18 18 8 2017 41 28 5 9 0 2018 63 42 41 36 12 2019 66 36 22 27 3 2020 38 16 10 13 2 2021 36 17 12 18 5 2022 38 19 15 11 2 Category Tropical Storm or Lower year 1989 682 1990 728 1991 634 1992 758 1993 666 1994 805 1995 560 1996 779 1997 829 1998 555 1999 508 2000 606 2001 504 2002 510 2003 615 2004 568 2005 523 2006 546 2007 513 2008 574 2009 629 2010 412 2011 500 2012 544 2013 583 2014 610 2015 663 2016 509 2017 515 2018 675 2019 641 2020 553 2021 609 2022 394
unique_names = sorted(modern_hurricanes_track_unique['BASIN'].unique())
print(unique_names)
['EP', 'NI', 'SA', 'SI', 'SP', 'WP']
Interestingly, no observation of cases to have a decent sample set for the atltantic.
Mann-Whitney Test for the Different Observed Basins¶
Concerns identifying any difference in basins for storms for years 1989 to 2022.
import pandas as pd
from scipy.stats import mannwhitneyu
import itertools
# Unique BASIN instances
basins = ['EP', 'NI', 'SA', 'SI', 'SP', 'WP']
# Function to perform Mann-Whitney U Test
def mann_whitney_test(df, column, basin1, basin2):
# Filter the data for each basin
data1 = modern_hurricanes_track_unique[modern_hurricanes_track_unique['BASIN'] == basin1][column]
data2 = modern_hurricanes_track_unique[modern_hurricanes_track_unique['BASIN'] == basin2][column]
# Check if both groups have data
if len(data1) > 0 and len(data2) > 0:
# Perform the Mann-Whitney U test
stat, p_value = mannwhitneyu(data1, data2, alternative='two-sided')
mean_diff = data1.mean() - data2.mean()
median_diff = data1.median() - data2.median()
greater_mean = basin1 if mean_diff > 0 else basin2
greater_median = basin1 if median_diff > 0 else basin2
return stat, p_value, mean_diff, median_diff, greater_mean, greater_median
else:
# Return None for each expected output if one of the groups is empty
return None, None, None, None, None, None
# Store results in lists and create DataFrames later
results_wind = []
results_pres = []
# Perform Mann-Whitney U Test for each combination of BASIN pairs
for basin1, basin2 in itertools.combinations(basins, 2):
# Test for USA_WIND
result_wind = mann_whitney_test(modern_hurricanes_track_unique, 'USA_WIND', basin1, basin2)
if result_wind[0] is not None: # Check if the test was valid (not None)
stat_wind, p_value_wind, mean_diff_wind, median_diff_wind, greater_mean_wind, greater_median_wind = result_wind
results_wind.append({
'BASIN1': basin1,
'BASIN2': basin2,
'Statistic': stat_wind,
'P_Value': p_value_wind,
'Mean_Difference': mean_diff_wind,
'Median_Difference': median_diff_wind,
'Greater_Mean': greater_mean_wind,
'Greater_Median': greater_median_wind
})
# Test for USA_PRES
result_pres = mann_whitney_test(modern_hurricanes_track_unique, 'USA_PRES', basin1, basin2)
if result_pres[0] is not None: # Check if the test was valid (not None)
stat_pres, p_value_pres, mean_diff_pres, median_diff_pres, greater_mean_pres, greater_median_pres = result_pres
results_pres.append({
'BASIN1': basin1,
'BASIN2': basin2,
'Statistic': stat_pres,
'P_Value': p_value_pres,
'Mean_Difference': mean_diff_pres,
'Median_Difference': median_diff_pres,
'Greater_Mean': greater_mean_pres,
'Greater_Median': greater_median_pres
})
# Convert lists to DataFrames
results_wind_df = pd.DataFrame(results_wind)
results_pres_df = pd.DataFrame(results_pres)
# Display results
print("Mann-Whitney U Test Results for USA_WIND:")
print(results_wind_df)
print("\nMann-Whitney U Test Results for USA_PRES:")
print(results_pres_df)
Mann-Whitney U Test Results for USA_WIND:
BASIN1 BASIN2 Statistic P_Value Mean_Difference \
0 EP NI 4766626.5 1.076687e-87 14.322577
1 EP SA 39912.0 1.076877e-01 13.984961
2 EP SI 17981544.5 1.722204e-92 9.478192
3 EP SP 8876503.0 3.805687e-64 9.685972
4 EP WP 27700667.0 8.222761e-57 3.853414
5 NI SA 8079.5 3.614357e-01 -0.337616
6 NI SI 3991975.0 3.357271e-07 -4.844385
7 NI SP 1984075.5 1.668213e-05 -4.636605
8 NI WP 5998358.0 5.722978e-22 -10.469163
9 SA SI 39799.5 8.975860e-01 -4.506769
10 SA SP 19586.5 9.117688e-01 -4.298989
11 SA WP 61049.0 8.105968e-01 -10.131547
12 SI SP 8928089.5 6.932966e-01 0.207780
13 SI WP 27171336.5 1.386588e-14 -5.624778
14 SP WP 13317750.0 6.151791e-11 -5.832558
Median_Difference Greater_Mean Greater_Median
0 10.0 EP EP
1 5.0 EP EP
2 5.0 EP EP
3 5.0 EP EP
4 5.0 EP EP
5 -5.0 SA SA
6 -5.0 SI SI
7 -5.0 SP SP
8 -5.0 WP WP
9 0.0 SI SI
10 0.0 SP SP
11 0.0 WP WP
12 0.0 SI SP
13 0.0 WP WP
14 0.0 WP WP
Mann-Whitney U Test Results for USA_PRES:
BASIN1 BASIN2 Statistic P_Value Mean_Difference \
0 EP NI 5455091.5 1.859533e-211 420.903597
1 EP SA 26792.0 3.266775e-01 -79.629684
2 EP SI 23872854.5 0.000000e+00 487.792406
3 EP SP 12098287.0 0.000000e+00 536.578276
4 EP WP 38617904.0 0.000000e+00 466.059056
5 NI SA 2789.5 3.071658e-06 -500.533280
6 NI SI 4785579.5 4.241358e-10 66.888809
7 NI SP 2476323.5 3.001844e-19 115.674679
8 NI WP 7699261.5 2.254912e-08 45.155459
9 SA SI 69987.5 4.917170e-08 567.422090
10 SA SP 35237.5 4.363927e-09 616.207960
11 SA WP 112926.0 1.275861e-07 545.688740
12 SI SP 9377982.0 1.774320e-06 48.785870
13 SI WP 28929909.0 1.473481e-01 -21.733350
14 SP WP 13458892.5 3.829466e-10 -70.519220
Median_Difference Greater_Mean Greater_Median
0 74.0 EP EP
1 0.0 SA SA
2 1003.0 EP EP
3 1003.0 EP EP
4 1003.0 EP EP
5 -74.0 SA SA
6 929.0 NI NI
7 929.0 NI NI
8 929.0 NI NI
9 1003.0 SA SA
10 1003.0 SA SA
11 1003.0 SA SA
12 0.0 SI SP
13 0.0 WP WP
14 0.0 WP WP
For the print outs of the Greater_Mean and Greater_Median such are just the declaration which basin has the larger value. Say, for USA_WIND, the case of EP versus NI, both the mean and median for EP are larger than those for NI, so EP shows in the first row in both cases for print out. Likewise for USA_PRES.
Hurricane Analysis for the Atlantic Basin¶
After numerous searches for access to hurricane history for the Atlantic basin, data was acquired from a Kaggle repository. However, such data history terminates at year 2015. Such is at least a 9 year gap in modern data. However, for the case of New York, such is still meaningful, since the last hurricane (remnant) to influence New York was IDA.
The National Hurricane Center (NHC) conducts a post-storm analysis of each tropical cyclone in the Atlantic basin (i.e., North Atlantic Ocean, Gulf of Mexico, and Caribbean Sea) and and the North Pacific Ocean to determine the official assessment of the cyclone's history. This analysis makes use of all available observations, including those that may not have been available in real time. In addition, NHC conducts ongoing reviews of any retrospective tropical cyclone analyses brought to its attention and on a regular basis updates the historical record to reflect changes introduced.
To now commence with the data assimilation, data wrangling and feature engineering:
import numpy as np
import pandas as pd
import kagglehub
# Download the dataset
path = kagglehub.dataset_download("noaa/hurricane-database")
print("Path to dataset files:", path)
Warning: Looks like you're using an outdated `kagglehub` version, please consider updating (latest version: 0.3.12) Path to dataset files: C:\Users\verlene\.cache\kagglehub\datasets\noaa\hurricane-database\versions\1
import pandas as pd
import os
# Specify the directory path
dir_path = r"C:\Users\verlene\.cache\kagglehub\datasets\noaa\hurricane-database\versions\1"
# Load 'atlantic.csv' or 'pacific.csv' into a pandas DataFrame
atlantic_path = os.path.join(dir_path, "atlantic.csv")
# Load the Atlantic data
df_atlantic = pd.read_csv(atlantic_path)
print("Atlantic Data:")
print(df_atlantic.head())
Atlantic Data:
ID Name Date Time Event Status Latitude \
0 AL011851 UNNAMED 18510625 0 HU 28.0N
1 AL011851 UNNAMED 18510625 600 HU 28.0N
2 AL011851 UNNAMED 18510625 1200 HU 28.0N
3 AL011851 UNNAMED 18510625 1800 HU 28.1N
4 AL011851 UNNAMED 18510625 2100 L HU 28.2N
Longitude Maximum Wind Minimum Pressure ... Low Wind SW Low Wind NW \
0 94.8W 80 -999 ... -999 -999
1 95.4W 80 -999 ... -999 -999
2 96.0W 80 -999 ... -999 -999
3 96.5W 80 -999 ... -999 -999
4 96.8W 80 -999 ... -999 -999
Moderate Wind NE Moderate Wind SE Moderate Wind SW Moderate Wind NW \
0 -999 -999 -999 -999
1 -999 -999 -999 -999
2 -999 -999 -999 -999
3 -999 -999 -999 -999
4 -999 -999 -999 -999
High Wind NE High Wind SE High Wind SW High Wind NW
0 -999 -999 -999 -999
1 -999 -999 -999 -999
2 -999 -999 -999 -999
3 -999 -999 -999 -999
4 -999 -999 -999 -999
[5 rows x 22 columns]
# Drop rows where 'Minimum Pressure' equals -999
df_atlantic = df_atlantic[df_atlantic['Minimum Pressure'] != -999]
# Check the updated data
print("Atlantic Data (after dropping -999 values):")
print(df_atlantic)
Atlantic Data (after dropping -999 values):
ID Name Date Time Event Status Latitude \
127 AL011852 UNNAMED 18520826 600 L HU 30.2N
252 AL031853 UNNAMED 18530903 1200 HU 19.7N
346 AL031854 UNNAMED 18540907 1200 HU 28.0N
351 AL031854 UNNAMED 18540908 1800 HU 31.6N
352 AL031854 UNNAMED 18540908 2000 L HU 31.7N
... ... ... ... ... ... ... ...
49100 AL122015 KATE 20151112 1200 EX 41.3N
49101 AL122015 KATE 20151112 1800 EX 41.9N
49102 AL122015 KATE 20151113 0 EX 41.5N
49103 AL122015 KATE 20151113 600 EX 40.8N
49104 AL122015 KATE 20151113 1200 EX 40.7N
Longitude Maximum Wind Minimum Pressure ... Low Wind SW \
127 88.6W 100 961 ... -999
252 56.2W 130 924 ... -999
346 78.6W 110 938 ... -999
351 81.1W 100 950 ... -999
352 81.1W 100 950 ... -999
... ... ... ... ... ...
49100 50.4W 55 981 ... 180
49101 49.9W 55 983 ... 180
49102 49.2W 50 985 ... 200
49103 47.5W 45 985 ... 180
49104 45.4W 45 987 ... 150
Low Wind NW Moderate Wind NE Moderate Wind SE Moderate Wind SW \
127 -999 -999 -999 -999
252 -999 -999 -999 -999
346 -999 -999 -999 -999
351 -999 -999 -999 -999
352 -999 -999 -999 -999
... ... ... ... ...
49100 120 120 120 60
49101 120 120 120 60
49102 220 120 120 60
49103 220 0 0 0
49104 220 0 0 0
Moderate Wind NW High Wind NE High Wind SE High Wind SW \
127 -999 -999 -999 -999
252 -999 -999 -999 -999
346 -999 -999 -999 -999
351 -999 -999 -999 -999
352 -999 -999 -999 -999
... ... ... ... ...
49100 0 0 0 0
49101 0 0 0 0
49102 0 0 0 0
49103 0 0 0 0
49104 0 0 0 0
High Wind NW
127 -999
252 -999
346 -999
351 -999
352 -999
... ...
49100 0
49101 0
49102 0
49103 0
49104 0
[18436 rows x 22 columns]
df_atlantic = df_atlantic.dropna()
unique_values = df_atlantic['Name'].unique()
unique_values
array([' UNNAMED', ' ABLE',
' BAKER', ' CHARLIE',
' DOG', ' EASY',
' FOX', ' GEORGE',
' ITEM', ' KING',
' LOVE', ' HOW',
' JIG', ' ALICE',
' BARBARA', ' CAROL',
' DOLLY', ' EDNA',
' FLORENCE', ' GAIL',
' HAZEL', ' GILDA',
' CONNIE', ' DIANE',
' EDITH', ' FLORA',
' GLADYS', ' IONE',
' HILDA', ' JANET',
' KATIE', ' ANNA',
' BETSY', ' CARLA',
' DORA', ' ETHEL',
' FLOSSY', ' GRETA',
' AUDREY', ' BERTHA',
' CARRIE', ' DEBBIE',
' ESTHER', ' FRIEDA',
' BECKY', ' CLEO',
' DAISY', ' ELLA',
' FIFI', ' GERDA',
' HELENE', ' ILSA',
' JANICE', ' ARLENE',
' BEULAH', ' CINDY',
' DEBRA', ' GRACIE',
' HANNAH', ' IRENE',
' JUDITH', ' ABBY',
' BRENDA', ' DONNA',
' FRANCES', ' HATTIE',
' JENNY', ' INGA',
' ALMA', ' CELIA',
' GINNY', ' HELENA',
' ISBELL', ' ELENA',
' DOROTHY', ' FAITH',
' HALLIE', ' INEZ',
' LOIS', ' CHLOE',
' DORIA', ' FERN',
' GINGER', ' HEIDI',
' CANDY', ' BLANCHE',
' CAMILLE', ' EVE',
' FRANCELIA', ' HOLLY',
' KARA', ' LAURIE',
' MARTHA', ' FELICE',
' BETH', ' KRISTY',
' LAURA', ' ALPHA',
' AGNES', ' BETTY',
' DAWN', ' DELTA',
' ALFA', ' CHRISTINE',
' DELIA', ' ELLEN',
' FRAN', ' CARMEN',
' ELAINE', ' GERTRUDE',
' AMY', ' CAROLINE',
' DORIS', ' ELOISE',
' FAYE', ' BELLE',
' DOTTIE', ' CANDICE',
' EMMY', ' GLORIA',
' ANITA', ' BABE',
' CLARA', ' EVELYN',
' AMELIA', ' BESS',
' CORA', ' FLOSSIE',
' HOPE', ' IRMA',
' JULIET', ' KENDRA',
' ANA', ' BOB',
' CLAUDETTE', ' DAVID',
' FREDERIC', ' HENRI',
' ALLEN', ' BONNIE',
' CHARLEY', ' GEORGES',
' EARL', ' DANIELLE',
' HERMINE', ' IVAN',
' JEANNE', ' KARL',
' BRET', ' DENNIS',
' EMILY', ' FLOYD',
' GERT', ' HARVEY',
' JOSE', ' KATRINA',
' ALBERTO', ' BERYL',
' CHRIS', ' DEBBY',
' ERNESTO', ' ALICIA',
' BARRY', ' CHANTAL',
' DEAN', ' ARTHUR',
' CESAR', ' DIANA',
' EDOUARD', ' GUSTAV',
' HORTENSE', ' ISIDORE',
' JOSEPHINE', ' KLAUS',
' LILI', ' DANNY',
' FABIAN', ' ISABEL',
' JUAN', ' KATE',
' ANDREW', ' GILBERT',
' ISAAC', ' JOAN',
' KEITH', ' ALLISON',
' ERIN', ' FELIX',
' GABRIELLE', ' HUGO',
' IRIS', ' JERRY',
' KAREN', ' MARCO',
' NANA', ' ERIKA',
' GRACE', ' GORDON',
' HUMBERTO', ' LUIS',
' MARILYN', ' NOEL',
' OPAL', ' PABLO',
' ROXANNE', ' SEBASTIEN',
' TANYA', ' KYLE',
' BILL', ' ALEX',
' LISA', ' MITCH',
' NICOLE', ' LENNY',
' JOYCE', ' LESLIE',
' MICHAEL', ' NADINE',
' LORENZO', ' MICHELLE',
' OLGA', ' CRISTOBAL',
' FAY', ' HANNA',
' LARRY', ' MINDY',
' NICHOLAS', ' ODETTE',
' PETER', ' GASTON',
' MATTHEW', ' OTTO',
' FRANKLIN', ' TEN',
' LEE', ' MARIA',
' NATE', ' OPHELIA',
' PHILIPPE', ' RITA',
' NINETEEN', ' STAN',
' TAMMY', ' TWENTY-TWO',
' VINCE', ' WILMA',
' BETA', ' GAMMA',
' EPSILON', ' ZETA',
' ANDREA', ' INGRID',
' MELISSA', ' FIFTEEN',
' IKE', ' OMAR',
' SIXTEEN', ' PALOMA',
' ONE', ' FRED',
' EIGHT', ' IDA',
' TWO', ' COLIN',
' FIVE', ' FIONA',
' IGOR', ' JULIA',
' PAULA', ' RICHARD',
' SHARY', ' TOMAS',
' DON', ' KATIA',
' RINA', ' SEAN',
' KIRK', ' OSCAR',
' PATTY', ' RAFAEL',
' SANDY', ' TONY',
' DORIAN', ' FERNAND',
' GONZALO', ' NINE',
' JOAQUIN'], dtype=object)
# Strip leading and trailing whitespace from column names
df_atlantic.columns = df_atlantic.columns.str.strip()
df_atlantic.info()
<class 'pandas.core.frame.DataFrame'> Index: 18436 entries, 127 to 49104 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 18436 non-null object 1 Name 18436 non-null object 2 Date 18436 non-null int64 3 Time 18436 non-null int64 4 Event 18436 non-null object 5 Status 18436 non-null object 6 Latitude 18436 non-null object 7 Longitude 18436 non-null object 8 Maximum Wind 18436 non-null int64 9 Minimum Pressure 18436 non-null int64 10 Low Wind NE 18436 non-null int64 11 Low Wind SE 18436 non-null int64 12 Low Wind SW 18436 non-null int64 13 Low Wind NW 18436 non-null int64 14 Moderate Wind NE 18436 non-null int64 15 Moderate Wind SE 18436 non-null int64 16 Moderate Wind SW 18436 non-null int64 17 Moderate Wind NW 18436 non-null int64 18 High Wind NE 18436 non-null int64 19 High Wind SE 18436 non-null int64 20 High Wind SW 18436 non-null int64 21 High Wind NW 18436 non-null int64 dtypes: int64(16), object(6) memory usage: 3.2+ MB
df_atlantic = df_atlantic.copy()
# Ensure 'Date' column is in string format
df_atlantic['Date'] = df_atlantic['Date'].astype(str)
# Convert 'Time' to string and format it correctly
df_atlantic['Time'] = df_atlantic['Time'].astype(str).str.zfill(4) # Ensure 4 digits
# Combine 'Date' and 'Time' columns into a single datetime column
df_atlantic['DateTime'] = pd.to_datetime(df_atlantic['Date'] + ' ' + df_atlantic['Time'].str[:2] + ':' + df_atlantic['Time'].str[2:])
# Drop the 'Date' and 'Time' columns from the Atlantic DataFrame
df_atlantic = df_atlantic.drop(columns=['Date', 'Time'])
# Remove any non-numeric characters (except for digits, '.' and '-') from Latitude and Longitude columns
df_atlantic['Latitude'] = df_atlantic['Latitude'].str.replace(r'[^\d.-]', '', regex=True)
df_atlantic['Longitude'] = df_atlantic['Longitude'].str.replace(r'[^\d.-]', '', regex=True)
# Convert the Longitude column to numeric, coercing errors to NaN
df_atlantic['Longitude'] = pd.to_numeric(df_atlantic['Longitude'], errors='coerce')
# Convert the Latitude column to numeric, coercing errors to NaN
df_atlantic['Latitude'] = pd.to_numeric(df_atlantic['Latitude'], errors='coerce')
# Update the Longitude column to be negative for positive values
df_atlantic['Longitude'] = df_atlantic['Longitude'].apply(lambda x: -abs(x) if x > 0 else x)
df_atlantic.info()
<class 'pandas.core.frame.DataFrame'> Index: 18436 entries, 127 to 49104 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 18436 non-null object 1 Name 18436 non-null object 2 Event 18436 non-null object 3 Status 18436 non-null object 4 Latitude 18436 non-null float64 5 Longitude 18436 non-null float64 6 Maximum Wind 18436 non-null int64 7 Minimum Pressure 18436 non-null int64 8 Low Wind NE 18436 non-null int64 9 Low Wind SE 18436 non-null int64 10 Low Wind SW 18436 non-null int64 11 Low Wind NW 18436 non-null int64 12 Moderate Wind NE 18436 non-null int64 13 Moderate Wind SE 18436 non-null int64 14 Moderate Wind SW 18436 non-null int64 15 Moderate Wind NW 18436 non-null int64 16 High Wind NE 18436 non-null int64 17 High Wind SE 18436 non-null int64 18 High Wind SW 18436 non-null int64 19 High Wind NW 18436 non-null int64 20 DateTime 18436 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(2), int64(14), object(4) memory usage: 3.1+ MB
# Extract the year from the DateTime column
df_atlantic['year'] = df_atlantic['DateTime'].dt.year
df_atlantic['Month'] = df_atlantic['DateTime'].dt.month
df_atlantic['Day'] = df_atlantic['DateTime'].dt.day
# Filter modern hurricanes from 1980 onwards
modern_hurricanes_tracks = df_atlantic[df_atlantic['year'] >= 1980]
modern_hurricanes_tracks.info()
<class 'pandas.core.frame.DataFrame'> Index: 14593 entries, 33704 to 49104 Data columns (total 24 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 14593 non-null object 1 Name 14593 non-null object 2 Event 14593 non-null object 3 Status 14593 non-null object 4 Latitude 14593 non-null float64 5 Longitude 14593 non-null float64 6 Maximum Wind 14593 non-null int64 7 Minimum Pressure 14593 non-null int64 8 Low Wind NE 14593 non-null int64 9 Low Wind SE 14593 non-null int64 10 Low Wind SW 14593 non-null int64 11 Low Wind NW 14593 non-null int64 12 Moderate Wind NE 14593 non-null int64 13 Moderate Wind SE 14593 non-null int64 14 Moderate Wind SW 14593 non-null int64 15 Moderate Wind NW 14593 non-null int64 16 High Wind NE 14593 non-null int64 17 High Wind SE 14593 non-null int64 18 High Wind SW 14593 non-null int64 19 High Wind NW 14593 non-null int64 20 DateTime 14593 non-null datetime64[ns] 21 year 14593 non-null int32 22 Month 14593 non-null int32 23 Day 14593 non-null int32 dtypes: datetime64[ns](1), float64(2), int32(3), int64(14), object(4) memory usage: 2.6+ MB
Applying the Saffir-Simpson Hurricane Wind Scale:
def categorize_hurricane(wind_speed):
if wind_speed >= 64 and wind_speed <= 82:
return 'Category 1'
elif wind_speed >= 83 and wind_speed <= 95:
return 'Category 2'
elif wind_speed >= 96 and wind_speed <= 112:
return 'Category 3'
elif wind_speed >= 113 and wind_speed <= 136:
return 'Category 4'
elif wind_speed > 137:
return 'Category 5'
else:
return 'Not a Hurricane' # For wind speeds below 74 mph
modern_hurricanes_tracks = modern_hurricanes_tracks.copy()
modern_hurricanes_tracks['HurricaneCategory'] = modern_hurricanes_tracks['Maximum Wind'].apply(categorize_hurricane)
# Display the updated DataFrame with the new 'HurricaneCategory' column
print(modern_hurricanes_tracks[['Name', 'Maximum Wind', 'HurricaneCategory']])
Name Maximum Wind HurricaneCategory 33704 ALLEN 30 Not a Hurricane 33705 ALLEN 30 Not a Hurricane 33706 ALLEN 30 Not a Hurricane 33707 ALLEN 30 Not a Hurricane 33708 ALLEN 35 Not a Hurricane ... ... ... ... 49100 KATE 55 Not a Hurricane 49101 KATE 55 Not a Hurricane 49102 KATE 50 Not a Hurricane 49103 KATE 45 Not a Hurricane 49104 KATE 45 Not a Hurricane [14593 rows x 3 columns]
print(modern_hurricanes_tracks['HurricaneCategory'].isna().sum())
print(modern_hurricanes_tracks['HurricaneCategory'].unique())
0 ['Not a Hurricane' 'Category 1' 'Category 2' 'Category 3' 'Category 4' 'Category 5']
modern_hurricanes_tracks.info()
<class 'pandas.core.frame.DataFrame'> Index: 14593 entries, 33704 to 49104 Data columns (total 25 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 14593 non-null object 1 Name 14593 non-null object 2 Event 14593 non-null object 3 Status 14593 non-null object 4 Latitude 14593 non-null float64 5 Longitude 14593 non-null float64 6 Maximum Wind 14593 non-null int64 7 Minimum Pressure 14593 non-null int64 8 Low Wind NE 14593 non-null int64 9 Low Wind SE 14593 non-null int64 10 Low Wind SW 14593 non-null int64 11 Low Wind NW 14593 non-null int64 12 Moderate Wind NE 14593 non-null int64 13 Moderate Wind SE 14593 non-null int64 14 Moderate Wind SW 14593 non-null int64 15 Moderate Wind NW 14593 non-null int64 16 High Wind NE 14593 non-null int64 17 High Wind SE 14593 non-null int64 18 High Wind SW 14593 non-null int64 19 High Wind NW 14593 non-null int64 20 DateTime 14593 non-null datetime64[ns] 21 year 14593 non-null int32 22 Month 14593 non-null int32 23 Day 14593 non-null int32 24 HurricaneCategory 14593 non-null object dtypes: datetime64[ns](1), float64(2), int32(3), int64(14), object(5) memory usage: 2.7+ MB
# Define the mapping for hurricane categories
category_mapping = {
'Not a Hurricane': 0,
'Category 1': 1,
'Category 2': 2,
'Category 3': 3,
'Category 4': 4,
'Category 5': 5
}
# Create a new column for the ordinal coding
modern_hurricanes_tracks['HurricaneCategoryOrdinal'] = modern_hurricanes_tracks['HurricaneCategory'].map(category_mapping)
# Display the updated DataFrame with the new ordinal column
print(modern_hurricanes_tracks[['HurricaneCategory', 'HurricaneCategoryOrdinal']].head(30))
HurricaneCategory HurricaneCategoryOrdinal 33704 Not a Hurricane 0 33705 Not a Hurricane 0 33706 Not a Hurricane 0 33707 Not a Hurricane 0 33708 Not a Hurricane 0 33709 Not a Hurricane 0 33710 Not a Hurricane 0 33711 Not a Hurricane 0 33712 Category 1 1 33713 Category 1 1 33714 Category 1 1 33715 Category 2 2 33716 Category 3 3 33717 Category 4 4 33718 Category 4 4 33719 Category 4 4 33720 Category 5 5 33721 Category 5 5 33722 Category 5 5 33723 Category 5 5 33724 Category 5 5 33725 Category 4 4 33726 Category 4 4 33727 Category 4 4 33728 Category 4 4 33729 Category 5 5 33730 Category 5 5 33731 Category 5 5 33732 Category 5 5 33733 Category 4 4
modern_hurricanes_tracks.info()
<class 'pandas.core.frame.DataFrame'> Index: 14593 entries, 33704 to 49104 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 14593 non-null object 1 Name 14593 non-null object 2 Event 14593 non-null object 3 Status 14593 non-null object 4 Latitude 14593 non-null float64 5 Longitude 14593 non-null float64 6 Maximum Wind 14593 non-null int64 7 Minimum Pressure 14593 non-null int64 8 Low Wind NE 14593 non-null int64 9 Low Wind SE 14593 non-null int64 10 Low Wind SW 14593 non-null int64 11 Low Wind NW 14593 non-null int64 12 Moderate Wind NE 14593 non-null int64 13 Moderate Wind SE 14593 non-null int64 14 Moderate Wind SW 14593 non-null int64 15 Moderate Wind NW 14593 non-null int64 16 High Wind NE 14593 non-null int64 17 High Wind SE 14593 non-null int64 18 High Wind SW 14593 non-null int64 19 High Wind NW 14593 non-null int64 20 DateTime 14593 non-null datetime64[ns] 21 year 14593 non-null int32 22 Month 14593 non-null int32 23 Day 14593 non-null int32 24 HurricaneCategory 14593 non-null object 25 HurricaneCategoryOrdinal 14593 non-null int64 dtypes: datetime64[ns](1), float64(2), int32(3), int64(15), object(5) memory usage: 2.8+ MB
Applying the Mann-Whitney Test for the Different Periods¶
1. Data Filtering:
The DataFrame modern_hurricanes_track is filtered into two periods:
Period 1: 1980 to 1997
Period 2: 1998 to 2015
2. Counting Storms:
A function count_storms is defined to count the number of storms per year in each period using groupby.
3. Mann-Whitney U Test Function:
A function mann_whitney_test is defined to perform the Mann-Whitney U Test and calculate the required statistics (statistic, p-value, mean difference, median difference).
4. Performing the Test:
The test is conducted between the two periods, and results are displayed.
from scipy.stats import mannwhitneyu
import itertools
import pandas as pd
# Assuming modern_hurricanes_track is already defined and has a 'year' column
# Filter the data for the two periods
period_1 = modern_hurricanes_tracks[(modern_hurricanes_tracks['year'] >= 1980) & (modern_hurricanes_tracks['year'] <= 1997)]
period_2 = modern_hurricanes_tracks[(modern_hurricanes_tracks['year'] >= 1998) & (modern_hurricanes_tracks['year'] <= 2015)]
# Create a function to count storms in each period
def count_storms(df):
return df.groupby('year').size()
# Count storms for each period
storm_counts_period_1 = count_storms(period_1)
storm_counts_period_2 = count_storms(period_2)
# Function to perform Mann-Whitney U Test
def mann_whitney_test(data1, data2):
if len(data1) > 0 and len(data2) > 0:
stat, p_value = mannwhitneyu(data1, data2, alternative='two-sided')
mean_diff = data2.mean() - data1.mean()
median_diff = data2.median() - data1.median()
greater_mean = 'Period 2' if mean_diff > 0 else 'Period 1'
greater_median = 'Period 2' if median_diff > 0 else 'Period 1'
return stat, p_value, mean_diff, median_diff, greater_mean, greater_median
else:
return None, None, None, None, None, None
# Perform the Mann-Whitney U Test between the two periods
result = mann_whitney_test(storm_counts_period_1, storm_counts_period_2)
# Check if the test was valid (not None) and display results
if result[0] is not None:
stat, p_value, mean_diff, median_diff, greater_mean, greater_median = result
print("Mann-Whitney U Test Results:")
print(f"Statistic: {stat}")
print(f"P-Value: {p_value}")
print(f"Mean Difference: {mean_diff}")
print(f"Median Difference: {median_diff}")
print(f"Greater Mean: {greater_mean}")
print(f"Greater Median: {greater_median}")
else:
print("One of the groups is empty, unable to perform the Mann-Whitney U Test.")
Mann-Whitney U Test Results: Statistic: 52.0 P-Value: 0.0005313629584409026 Mean Difference: 183.38888888888886 Median Difference: 176.5 Greater Mean: Period 2 Greater Median: Period 2
Clustering based on the the Saffir-Simpson Hurricane Wind Scale (in knots)¶
hurricane_names = ['GLORIA', 'ANDREW', 'FELIX', 'LUIS', 'OPAL',
'BERTHA', 'DANNY', 'FLOYD', 'GORON',
'ISODORE', 'ISABEL', 'ALEX', 'CHARLEY',
'GASTON', 'FRANCES', 'CINDY', 'KATRINA',
'ERNESTO', 'HANNA', 'BILL',
'IRENE', 'SANDY', 'ARTHUR', 'MATTHEW', 'GERT',
'DORIAN', 'LAURA', 'DELTA', 'ELISA', 'HENRI', 'IDA',
'FRANKLIN', 'LEE', 'ERNESTO', 'IRMA', 'GEORGES', 'MARILYN', 'HUGO',
'BERYL', 'TAMMY', 'PHILIPPE', 'FRANKLIN', 'BRET', 'FIONA', 'EARL',
'SAM', 'GRACE', 'ELSA', 'JOSEPHINE', 'ISAIAS']
# Filter the DataFrame for these names
Hurricana_influence_popular = modern_hurricanes_tracks[modern_hurricanes_tracks['Name'].str.strip().isin(hurricane_names)]
import pandas as pd
import matplotlib.pyplot as plt
from kmodes.kprototypes import KPrototypes
from sklearn.metrics import silhouette_score
import folium
import seaborn as sns
# Select the relevant columns for clustering
X_start_coor = Hurricana_influence_popular[['Latitude', 'Longitude', 'HurricaneCategoryOrdinal']].copy()
# Step 2: Run K-Prototypes
cost = []
k_values = range(1, 11)
for k in k_values:
kproto = KPrototypes(n_clusters=k, init='Huang', random_state=42)
clusters = kproto.fit_predict(X_start_coor, categorical=[2]) # Categorical attribute at index 2
cost.append(kproto.cost_)
# Step 3: Plot the Elbow Curve
plt.figure(figsize=(10, 6))
plt.plot(k_values, cost)
plt.xlabel('Number of Clusters')
plt.ylabel('Cost')
plt.title('Elbow Method for K-Prototypes')
plt.xticks(k_values)
plt.grid(True)
plt.show()
# Calculate Silhouette Scores
silhouette_scores = []
for k in k_values[1:]:
kproto = KPrototypes(n_clusters=k, init='Huang', random_state=42)
clusters = kproto.fit_predict(X_start_coor, categorical=[2])
score = silhouette_score(X_start_coor, clusters)
silhouette_scores.append(score)
# Step 4: Plot Silhouette Scores
plt.figure(figsize=(10, 6))
plt.plot(k_values[1:], silhouette_scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score Method for K-Prototypes')
plt.xticks(k_values[1:])
plt.grid(True)
plt.show()
# Clustering and Visualization
for optimal_k in [3, 7]:
kproto = KPrototypes(n_clusters=optimal_k, init='Huang', random_state=42)
clusters = kproto.fit_predict(X_start_coor, categorical=[2])
X_start_coor['Cluster'] = clusters.astype(int)
# Plotting with Matplotlib
plt.figure(figsize=(12, 8))
scatter = plt.scatter(X_start_coor['Longitude'], X_start_coor['Latitude'],
c=X_start_coor['Cluster'], cmap='viridis', alpha=0.6, edgecolor='k')
plt.title(f'Cluster Visualization of Storm Events with k={optimal_k}')
plt.xlabel('Longitude')
plt.ylabel('Latitude')
plt.colorbar(scatter, label='Cluster')
plt.grid(True)
plt.show()
# Optional: Interactive Map Visualization using Folium
map_clusters = folium.Map(location=[X_start_coor['Latitude'].mean(),
X_start_coor['Longitude'].mean()], zoom_start=5)
colors = sns.color_palette("viridis", optimal_k).as_hex()
for _, row in X_start_coor.iterrows():
cluster_index = int(row['Cluster'])
folium.CircleMarker(
[row['Latitude'], row['Longitude']],
radius=5,
color=colors[cluster_index],
fill=True,
fill_color=colors[cluster_index],
fill_opacity=0.7,
popup=f"Storm Type: {row['HurricaneCategoryOrdinal']}, Cluster: {row['Cluster']}"
).add_to(map_clusters)
# Display the interactive map (if running in a Jupyter Notebook environment)
display(map_clusters)
Time Series for Hurricanes that Influenced New York and Montserrat from 1980 to 2015¶
NOTE: weak category 1 hurricanes can fluctuate between category 1, tropical storms and tropical depressions with pressure ranging from 1005 mb to 1016 mb. Yet, it's still possible breach such range. Of consequence, the pressure parameter to be set to 1024 to avoid triggering invalid (imaginary) outputs.
# Summary statistics for the 'pressure' column
pressure_summary = Hurricana_influence_popular['Minimum Pressure'].describe()
print(pressure_summary)
count 4220.000000 mean 992.809479 std 19.723971 min 902.000000 25% 985.000000 50% 1000.000000 75% 1007.000000 max 1024.000000 Name: Minimum Pressure, dtype: float64
# Iterate over each unique hurricane and plot individually
for name in Hurricana_influence_popular['Name'].unique():
# Filter data for the specific hurricane
hurricane_data = Hurricana_influence_popular[Hurricana_influence_popular['Name'] == name]
# Plotting
plt.figure(figsize=(10, 6))
sns.lineplot(
x=hurricane_data['Day'],
y=hurricane_data['Maximum Wind'],
marker='o',
label=f'{name} ({hurricane_data["year"].iloc[0]}-{hurricane_data["Month"].iloc[0]})'
)
# Add pressure as circle markers with varying size and color based on magnitude
plt.scatter(
x=hurricane_data['Day'],
y=hurricane_data['Maximum Wind'],
c=hurricane_data['Minimum Pressure'],
s=(1024 - hurricane_data['Minimum Pressure']) * 2, # size based on pressure
cmap='coolwarm',
alpha=0.7,
edgecolor='k'
)
plt.title(f'Hurricane {name}: Wind Speeds with Pressure Indicators')
plt.xlabel('Day of Month')
plt.ylabel('Wind Speed (knots)')
plt.colorbar(label='Pressure (millibars)')
plt.legend()
plt.show()
Such prior plots are like "fingerprints" for hurricanes, which can only be acquired from pre-existing data.
Influential Meteorological Phenomena¶
Scales of meteorological phenomena are based on their size (horizontal extent) and duration:
Microscale < 2 km - Seconds to minutes - Tornadoes, gusts, turbulence
Mesoscale 2 – 200 km - Minutes to hours - Thunderstorms, squall lines, sea breezes
Synoptic 200 – 2000+ km - Days to a week - Hurricanes, mid-latitude cyclones, cold fronts
Planetary > 2000 km - Weeks to months - Jet streams, Rossby waves
Mesoscale Meteorological Phenomena:
Thunderstorms (singular, multicell and supercell storms)
Tornados (despite being a small-scale phenomena, they arise in mesoscale environments such as supercell thunderstorms)
Fronts (can influence local weather patterns, especially where larger systems interact)
Sea Breezes and Land Breezes (local wind systems influenced by differential heating of land and water are typical mesoscale phenomena)
Orographic Lifting (being the impact of terrains on wind patterns and preciptation that can lead to mesoscale events)
Squall Lines (long lines of thunderstorms associated with cold fronts)
Drylines (boundaries between different air masses, particularly warms, mosit air and hot dry air, often leading to storm formation)
Importance of Mesoscale Analysis
comprehending mesoscale process is vital for weather forecasting, especially for forecasting severe weather events like thundertorms, hail. tornadoes, and flash flooding.
Classifying Extreme Weather Events¶
For extreme weather events one needs to identify the conditions for such. For a purely tropical ambiance in particular, the attributes of interest are temperature, air pressure (concerning tropical depressions, tropical storms or hurricanes), rainfall level, windspeed nad wind gusts.
To now identify the key parameters for each attribute:
- In celsius measure 35°C is generally considered quite hot.
- In celsius -15°C degrees is considered quite cold.
- The highest air pressure, translating to the weakness recognised tropical cyclones, disturbances or waves, is 1016 hPa; such is generally the upper limit for recognised weakest systems.
- For rain fall torrential downpours of at least 30 mm within a hour is considered an extreme event;
- For wind speed, at least 70 km/h is considered hazardous; same for wind gusts.
To identify an extreme event in a data point or event (out of 1 through 5 prior), one checks if the current value is extreme compared to recent history (usually a rolling window). Hence, one is tagging the observation with a numeric code representing the type of outlier.
FURTHER CLARIFICATIONS --
- If current temperature at 2 meters is ≥ 35°C (heatwave threshold), and the recent rolling average is < 35°C, then it’s considered a sudden spike, not a gradual warming trend.
- If current temp is ≤ -15°C (severe cold), but the average isn’t that low — Extreme cold.
- Mean sea level pressure has dropped below 1016 hPa, possibly indicating a low-pressure system (e.g., storm), but if the average pressure was higher, this drop is sudden — Possible threatening weather system.
- Very high rainfall event (≥ 30 mm) within an hour. Not part of a rainy trend → it's an anomalous downpour — Extreme rainfall.
- For wind speed or wind gust, if either of them hit 70 km/h or higher, and this isn’t typical in the rolling window — Extreme wind or gusts.
import openmeteo_requests
import pandas as pd
import requests_cache
from retry_requests import retry
# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)
# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
"latitude": 16.7425,
"longitude": -62.1874,
"start_date": "2022-01-08",
"end_date": "2025-06-24",
"hourly": ["temperature_2m", "rain", "wind_speed_10m", "wind_speed_100m", "wind_gusts_10m", "pressure_msl"],
"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)
# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")
# Process hourly data. The order of variables needs to be the same as requested.
hourly = response.Hourly()
hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
hourly_rain = hourly.Variables(1).ValuesAsNumpy()
hourly_wind_speed_10m = hourly.Variables(2).ValuesAsNumpy()
hourly_wind_speed_100m = hourly.Variables(3).ValuesAsNumpy()
hourly_wind_gusts_10m = hourly.Variables(4).ValuesAsNumpy()
hourly_pressure_msl = hourly.Variables(5).ValuesAsNumpy()
hourly_data = {"date": pd.date_range(
start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
end = pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
freq = pd.Timedelta(seconds = hourly.Interval()),
inclusive = "left"
)}
hourly_data["temperature_2m"] = hourly_temperature_2m
hourly_data["rain"] = hourly_rain
hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
hourly_data["wind_speed_100m"] = hourly_wind_speed_100m
hourly_data["wind_gusts_10m"] = hourly_wind_gusts_10m
hourly_data["pressure_msl"] = hourly_pressure_msl
hourly_dataframe_extreme = pd.DataFrame(data = hourly_data)
print(hourly_dataframe_extreme)
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
date temperature_2m rain wind_speed_10m \
0 2022-01-08 04:00:00+00:00 23.249001 0.0 28.146843
1 2022-01-08 05:00:00+00:00 22.598999 0.0 27.255590
2 2022-01-08 06:00:00+00:00 22.348999 0.0 30.498180
3 2022-01-08 07:00:00+00:00 21.848999 0.1 28.241076
4 2022-01-08 08:00:00+00:00 22.098999 0.1 29.215502
... ... ... ... ...
30331 2025-06-24 23:00:00+00:00 NaN NaN NaN
30332 2025-06-25 00:00:00+00:00 NaN NaN NaN
30333 2025-06-25 01:00:00+00:00 NaN NaN NaN
30334 2025-06-25 02:00:00+00:00 NaN NaN NaN
30335 2025-06-25 03:00:00+00:00 NaN NaN NaN
wind_speed_100m wind_gusts_10m pressure_msl
0 34.634918 56.160000 1018.500000
1 33.466450 57.599998 1018.299988
2 36.707645 60.120003 1017.599976
3 34.743263 62.279995 1017.500000
4 35.565376 58.679996 1017.400024
... ... ... ...
30331 NaN NaN NaN
30332 NaN NaN NaN
30333 NaN NaN NaN
30334 NaN NaN NaN
30335 NaN NaN NaN
[30336 rows x 7 columns]
hourly_dataframe_extreme_clean = hourly_dataframe_extreme.dropna()
hourly_dataframe_extreme_clean.info()
<class 'pandas.core.frame.DataFrame'> Index: 30309 entries, 0 to 30308 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 30309 non-null datetime64[ns, UTC] 1 temperature_2m 30309 non-null float32 2 rain 30309 non-null float32 3 wind_speed_10m 30309 non-null float32 4 wind_speed_100m 30309 non-null float32 5 wind_gusts_10m 30309 non-null float32 6 pressure_msl 30309 non-null float32 dtypes: datetime64[ns, UTC](1), float32(6) memory usage: 1.2 MB
hourly_dataframe_extreme_clean.isna().sum()
date 0 temperature_2m 0 rain 0 wind_speed_10m 0 wind_speed_100m 0 wind_gusts_10m 0 pressure_msl 0 dtype: int64
# Define the rolling window size
window_size = 30
# Initialize lists to hold outlier types and rule descriptions
outlier_types = []
outlier_rules = []
# Define the rules for each outlier code
outlier_descriptions = {
0: "Not an extreme event or insufficient data",
1: "Extreme heat: temperature_2m ≥ 35°C and 30-day mean < 35°C",
2: "Extreme cold: temperature_2m ≤ -15°C and 30-day mean > -15°C",
3: "Possible threatening weather system: pressure_msl ≤ 1016 hPa and 30-day mean > 1016 hPa",
4: "Extreme rainfall: rain ≥ 30 mm and 30-day mean < 30 mm",
5: "Extreme wind speed (10m): wind_speed_10m ≥ 70 km/h and 30-day mean < 70 km/h",
6: "Extreme wind speed (100m): wind_speed_100m ≥ 70 km/h and 30-day mean < 70 km/h",
7: "Extreme wind gusts: wind_gusts_10m ≥ 70 km/h and 30-day mean < 70 km/h"
}
# Iterate through each row to label outliers based on rule
for index, row in hourly_dataframe_extreme_clean.iterrows():
if index >= window_size - 1:
temp_window = hourly_dataframe_extreme_clean['temperature_2m'][index - window_size + 1:index + 1]
pressure_window = hourly_dataframe_extreme_clean['pressure_msl'][index - window_size + 1:index + 1]
rainfall_window = hourly_dataframe_extreme_clean['rain'][index - window_size + 1:index + 1]
wind_speed_10m_window = hourly_dataframe_extreme_clean['wind_speed_10m'][index - window_size + 1:index + 1]
wind_speed_100m_window = hourly_dataframe_extreme_clean['wind_speed_100m'][index - window_size + 1:index + 1]
wind_gusts_10m_window = hourly_dataframe_extreme_clean['wind_gusts_10m'][index - window_size + 1:index + 1]
if row['temperature_2m'] >= 35 and (temp_window.mean() < 35):
code = 1
elif row['temperature_2m'] <= -15 and (temp_window.mean() > -15):
code = 2
elif row['pressure_msl'] <= 1016 and (pressure_window.mean() > 1016):
code = 3
elif row['rain'] >= 30 and (rainfall_window.mean() < 30):
code = 4
elif row['wind_speed_10m'] >= 70 and (wind_speed_10m_window.mean() < 70):
code = 5
elif row['wind_speed_100m'] >= 70 and (wind_speed_100m_window.mean() < 70):
code = 6
elif row['wind_gusts_10m'] >= 70 and (wind_gusts_10m_window.mean() < 70):
code = 7
else:
code = 0
else:
code = 0 # Not enough data for comparison
outlier_types.append(code)
outlier_rules.append(outlier_descriptions[code])
# Assign new columns safely with .loc
hourly_dataframe_extreme_clean = hourly_dataframe_extreme_clean.copy()
hourly_dataframe_extreme_clean.loc[:, 'outlier_type'] = outlier_types
hourly_dataframe_extreme_clean.loc[:, 'outlier_rule'] = outlier_rules
# Print unique rules (excluding "not extreme")
unique_rules = set(outlier_rules)
print("Unique extreme event rules detected:")
for rule in unique_rules:
if rule != outlier_descriptions[0]:
print(f"- {rule}")
# Filter extreme events only
extreme_events = hourly_dataframe_extreme_clean[hourly_dataframe_extreme_clean['outlier_type'] != 0].copy()
# Convert timezone-aware datetime to naive and then to float seconds since epoch
time_column = hourly_dataframe_extreme_clean['date'].dt.tz_localize(None).astype('int64') / 1e9
# Adjust time_column safely
time_column.loc[time_column <= 0] += 1e-6
# Assign event observed column safely
hourly_dataframe_extreme_clean.loc[:, 'event_observed'] = hourly_dataframe_extreme_clean['outlier_type'].apply(lambda x: 1 if x != 0 else 0)
# Print the updated DataFrame
print(hourly_dataframe_extreme_clean)
Unique extreme event rules detected:
- Possible threatening weather system: pressure_msl ≤ 1016 hPa and 30-day mean > 1016 hPa
- Extreme wind speed (100m): wind_speed_100m ≥ 70 km/h and 30-day mean < 70 km/h
- Extreme wind gusts: wind_gusts_10m ≥ 70 km/h and 30-day mean < 70 km/h
date temperature_2m rain wind_speed_10m \
0 2022-01-08 04:00:00+00:00 23.249001 0.0 28.146843
1 2022-01-08 05:00:00+00:00 22.598999 0.0 27.255590
2 2022-01-08 06:00:00+00:00 22.348999 0.0 30.498180
3 2022-01-08 07:00:00+00:00 21.848999 0.1 28.241076
4 2022-01-08 08:00:00+00:00 22.098999 0.1 29.215502
... ... ... ... ...
30304 2025-06-23 20:00:00+00:00 25.449001 0.0 41.403522
30305 2025-06-23 21:00:00+00:00 25.648998 0.0 40.892101
30306 2025-06-23 22:00:00+00:00 25.799000 0.0 42.026817
30307 2025-06-23 23:00:00+00:00 25.098999 0.0 42.705925
30308 2025-06-24 00:00:00+00:00 25.549000 0.1 41.760387
wind_speed_100m wind_gusts_10m pressure_msl outlier_type \
0 34.634918 56.160000 1018.500000 0
1 33.466450 57.599998 1018.299988 0
2 36.707645 60.120003 1017.599976 0
3 34.743263 62.279995 1017.500000 0
4 35.565376 58.679996 1017.400024 0
... ... ... ... ...
30304 46.980347 52.560001 1015.099976 0
30305 46.474869 52.560001 1015.000000 0
30306 47.786861 52.560001 1015.500000 0
30307 48.116932 56.160000 1016.400024 0
30308 47.160343 54.360001 1017.099976 0
outlier_rule event_observed
0 Not an extreme event or insufficient data 0
1 Not an extreme event or insufficient data 0
2 Not an extreme event or insufficient data 0
3 Not an extreme event or insufficient data 0
4 Not an extreme event or insufficient data 0
... ... ...
30304 Not an extreme event or insufficient data 0
30305 Not an extreme event or insufficient data 0
30306 Not an extreme event or insufficient data 0
30307 Not an extreme event or insufficient data 0
30308 Not an extreme event or insufficient data 0
[30309 rows x 10 columns]
Interpretation of the Extreme Events Detected¶
Caution: the time span of the data can be considered small. However, concerning research, hourly data over various years can be quite computationally expensive. Specifically for the Montserrat territory, by consensus its climate is tropical. Hence, cold temperatures observed in temperate and artic climates are highly inplausible. Extreme high temperature trends not being present can be attributed to Montserrat being a very small land mass island in the Caribbean Sea; highly influenced by coastal or oceanic-atmospheric dynamic.
Unique extreme event rules detected:
Possible threatening weather system: pressure_msl ≤ 1016 hPa and 30-day mean > 1016 hPa
Extreme wind speed (100m): wind_speed_100m ≥ 70 km/h and 30-day mean < 70 km/h
Extreme wind gusts: wind_gusts_10m ≥ 70 km/h and 30-day mean < 70 km/h
Concerning pressure_msl, pressures at 1016 hPa or below are related to Hurricanes, tropical depressions, tropical storms, tropical waves, etc., etc., etc. Extreme wind speeds and extreme wind gusts are heavliy tied to low pressure systems; yet prior observed (Pearson) correlation measures for wind_speed_10m and wind_speed_100m with pressure_msl yielding (036. 0.39, etc.) are a bit disappointing when negative measure in correlation is expected.
Nevertheless, models in physics exonerate an "inverse" relationship between (atmospheric) pressure and wind speed/gust.
Physics Models Relating Atmospheric Pressure and Wind Speed¶
To model the relationship between low atmospheric pressure and high wind speeds/gusts, several fundamental physics models and equations from atmospheric dynamics and fluid mechanics are relevant. These capture the behavior of air flow in response to pressure gradients and the Earth's rotation. However, each model is relevant for specific settings.
1. Pressure Gradient Force (PGF)¶
Being the primary force responsible for wind. Air naturally moves from high-pressure areas to low-pressure areas due to the pressure gradient force:
$$ \vec{F}_{\text{PGF}} = -\frac{1}{\rho} \nabla P $$$ \vec{F}_{\text{PGF}} $ : Pressure Gradient Force per unit mass (vector)
$ \rho $ : Air density ($\frac{kg}{m^3}$)
$ \nabla P $ : Gradient of pressure (change in pressure over distance)
Air accelerates from high to low pressure; the stronger the pressure gradient, the stronger the resulting force and hence wind.
Such model is generally observed, however, such a model is stronlgy recognised for highly controlled environments like hydraulics, water management in civil engineering, etc.
Geostrophic Wind Equation (Large-Scale, Upper Atmosphere)¶
In large-scale atmospheric flows (away from surface friction), wind tends to balance between the Coriolis force and the pressure gradient force.
$$ \vec{v}_g = \frac{1}{f \rho} \hat{k} \times \nabla P $$$\vec{v}_g$: Geostrophic wind velocity
$f = 2\Omega \sin \phi$: Coriolis parameter (Earth's rotation rate $\Omega$ and latitude $\phi$)
$\hat{k}$: Unit vector in the vertical direction
Cyclostrophic Wind Equation (Small-Scale such as Hurricanes, Tornadoes)¶
Applicable to small-scale, rapidly rotating low-pressure systems (e.g., tornadoes, tropical cyclones) where Coriolis force is negligible.
$$ \frac{v^2}{r} = \frac{1}{\rho} \frac{dP}{dr} $$$v$: Wind speed
$r$: Radius from the center of rotation
$\frac{dP}{dr}$: Radial pressure gradient
Gradient Wind Equation (Curved Flow Around Lows)¶
A generalization of geostrophic and cyclostrophic wind, includes both Coriolis and centripetal forces.
$$ \frac{v^2}{r} + fv = \frac{1}{\rho} \frac{dP}{dr} $$Includes both Coriolis force ($fv$) and centripetal force ($v^2/r$)
Observed is a quadratic equation in $v$ to solve.
Bernoulli’s Principle (Idealized, Steady Flow)¶
In special cases (non-rotating, frictionless, incompressible air), energy conservation applies. In ideal, frictionless, incompressible flow:
$$ \frac{P}{\rho} + \frac{v^2}{2} + gh = \text{constant} $$$P$: Pressure
$v$: Wind speed
$g$: Gravitational acceleration
$h$: Height
Navier-Stokes Equations (Full Atmospheric Motion - Numerical Modelling)¶
The full motion of air parcels includes pressure gradient, Coriolis, and friction forces. To fully simulate wind, especially in numerical weather prediction models:
$$ \frac{D\vec{v}}{Dt} = -\frac{1}{\rho} \nabla P + \vec{F}_c + \vec{F}_{\text{fric}} $$$\frac{D\vec{v}}{Dt}$: Material (total) derivative of velocity
$\vec{F}_c$: Coriolis force
$\vec{F}_{\text{fric}}$: Frictional force
The seemingly most promising or applicable models concerning direct data integretion, are the Cyclostrophic Wind Equation and the Gradient Wind Equation. Observation of the respective parameters or attributes, such two models are highly relatable to common meteoroligical data and data for aggressive weather activity such as tropical waves, tropical depressions, cyclones and tornado events. The identified atmospheric pressure - wind speed relationship can be observed.
Programming to Demonstrate the Credibility of the Gradient Wind Equation and Cyclostrophic Wind Equation¶
To now evaluate how well the gradient wind balance approximates the actual near-surface wind speeds observed in atmospheric data. The gradient wind balance is a physical relationship describing wind flow around curved pressure fields, such as cyclones, incorporating the effects of pressure gradient force, Coriolis force, and centrifugal acceleration.
Will be applying ERA5 data from the Copernicus Programme of the European Union; specifically, the ERA5 hourly data on single levels from 1940 to present. However due to data constrainst will foucus on the 2024 hurricane season (June 1, and officially ended on November 30), centered on the Lesser Antilles of the Caribbean. The Lesser Antilles are located in the eastern and southeastern Caribbean Sea, extending from the Virgin Islands in the north to Trinidad and Tobago in the south. The approximate coordinates are between 10° and 16° N latitude and -60° and -63° W longitude. This region includes the Leeward Islands and Windward Islands. Specification for the 2024 year, because at this time such is the most current and full hurricane season on record; aside from data volume restrictions and intentionally specifying 4 hour intervals rather than 6 hour intervals to reduce loss of "intrinsic" information.
Such data is chosen because there are attributes that are applicable to the gradient wind equation and cyclostrophic wind equation. Namely, vectorial components of wind speed.
By calculating the theoretical gradient wind speeds from pressure, temperature, and wind fields, and comparing them to the observed 10-meter wind speeds, the programming to quantify the accuracy of the gradient wind approximation over time and space. The resulting error metrics (RMSE and bias) help diagnose the dynamical consistency of the dataset and reveal how closely the atmosphere follows gradient wind balance under various conditions.
This analysis is valuable for meteorologists and atmospheric scientists studying wind dynamics, verifying numerical weather models, or investigating cyclone structures.
import xarray as xr
import pandas as pd
# Load the NetCDF file
ds = xr.open_dataset('data_stream-oper_stepType-instant.nc')
# View the structure
print(ds)
# Convert to DataFrame
df = ds.to_dataframe().reset_index()
# Display a preview
print(df.head())
<xarray.Dataset>
Dimensions: (valid_time: 1281, latitude: 25, longitude: 13)
Coordinates:
number int64 ...
* valid_time (valid_time) datetime64[ns] 2024-06-01 ... 2024-11-30T23:00:00
* latitude (latitude) float64 18.0 17.75 17.5 17.25 ... 12.5 12.25 12.0
* longitude (longitude) float64 -63.0 -62.75 -62.5 ... -60.5 -60.25 -60.0
expver (valid_time) object ...
Data variables:
u10 (valid_time, latitude, longitude) float32 ...
v10 (valid_time, latitude, longitude) float32 ...
t2m (valid_time, latitude, longitude) float32 ...
msl (valid_time, latitude, longitude) float32 ...
sst (valid_time, latitude, longitude) float32 ...
sp (valid_time, latitude, longitude) float32 ...
Attributes:
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2025-07-10T17:39 GRIB to CDM+CF via cfgrib-0.9.1...
valid_time latitude longitude number expver u10 v10 \
0 2024-06-01 18.0 -63.00 0 0001 -6.922592 0.909912
1 2024-06-01 18.0 -62.75 0 0001 -6.799545 0.697998
2 2024-06-01 18.0 -62.50 0 0001 -6.703842 0.547607
3 2024-06-01 18.0 -62.25 0 0001 -6.710678 0.451904
4 2024-06-01 18.0 -62.00 0 0001 -6.640366 0.364990
t2m msl sst sp
0 301.883209 101641.75 302.028809 101591.40625
1 301.867584 101650.00 302.030762 101637.40625
2 301.848053 101657.50 302.159668 101665.40625
3 301.850006 101663.75 302.169434 101658.40625
4 301.861725 101669.25 302.138184 101672.40625
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 416325 entries, 0 to 416324 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 valid_time 416325 non-null datetime64[ns] 1 latitude 416325 non-null float64 2 longitude 416325 non-null float64 3 number 416325 non-null int64 4 expver 416325 non-null object 5 u10 416325 non-null float32 6 v10 416325 non-null float32 7 t2m 416325 non-null float32 8 msl 416325 non-null float32 9 sst 409920 non-null float32 10 sp 416325 non-null float32 dtypes: datetime64[ns](1), float32(6), float64(2), int64(1), object(1) memory usage: 25.4+ MB
df
| valid_time | latitude | longitude | number | expver | u10 | v10 | t2m | msl | sst | sp | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2024-06-01 00:00:00 | 18.0 | -63.00 | 0 | 0001 | -6.922592 | 0.909912 | 301.883209 | 101641.75 | 302.028809 | 101591.40625 |
| 1 | 2024-06-01 00:00:00 | 18.0 | -62.75 | 0 | 0001 | -6.799545 | 0.697998 | 301.867584 | 101650.00 | 302.030762 | 101637.40625 |
| 2 | 2024-06-01 00:00:00 | 18.0 | -62.50 | 0 | 0001 | -6.703842 | 0.547607 | 301.848053 | 101657.50 | 302.159668 | 101665.40625 |
| 3 | 2024-06-01 00:00:00 | 18.0 | -62.25 | 0 | 0001 | -6.710678 | 0.451904 | 301.850006 | 101663.75 | 302.169434 | 101658.40625 |
| 4 | 2024-06-01 00:00:00 | 18.0 | -62.00 | 0 | 0001 | -6.640366 | 0.364990 | 301.861725 | 101669.25 | 302.138184 | 101672.40625 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 416320 | 2024-11-30 23:00:00 | 12.0 | -61.00 | 0 | 0001 | -7.324890 | -0.246597 | 300.952332 | 101233.00 | 302.183105 | 101259.50000 |
| 416321 | 2024-11-30 23:00:00 | 12.0 | -60.75 | 0 | 0001 | -7.378601 | -0.223160 | 301.036316 | 101238.00 | 302.371582 | 101251.50000 |
| 416322 | 2024-11-30 23:00:00 | 12.0 | -60.50 | 0 | 0001 | -7.423523 | -0.317886 | 301.239441 | 101241.00 | 302.478027 | 101247.50000 |
| 416323 | 2024-11-30 23:00:00 | 12.0 | -60.25 | 0 | 0001 | -7.399109 | -0.437027 | 301.368347 | 101242.50 | 302.529785 | 101239.50000 |
| 416324 | 2024-11-30 23:00:00 | 12.0 | -60.00 | 0 | 0001 | -7.374695 | -0.499527 | 301.452332 | 101244.75 | 302.526855 | 101238.50000 |
416325 rows × 11 columns
PRELIMINARY OBSERVATION (with Pearson correlation upon ERA5 data) --
# Applying Pearson correlation to attributes of interest
import seaborn as sns
ERA5_data_corr = df.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(ERA5_data_corr, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation heatmap of Montserrat Daily Meteorological Data')
plt.savefig('daily_heatmap.pdf', format = 'pdf')
plt.show()
Observing the above (Pearson correlation) heatmap for the ERA5 data centered on the hurricane season in the Caribbean, negative correlation is observed between wind speed measures and pressure; stronger negative correlation measure observed for $u_{10}$ (the zonal or east-west component of the wind at 10 meters above the surface) than $v_{10}$ (the meridional or north-south component of the wind at 10 meters above the surface). Nevertheless, the negative correlation measures can reflect the pressure-wind speed relationship concerning cyclones, tropical depresssions, tropical storms, tropical waves, tornadoes, etc.
Diagnostic Visualization of Near-Surface Atmospheric Dynamics Involving Wind Vector fields and Mean sea Level Pressure¶
To now presents a diagnostic visualization of near-surface atmospheric dynamics using 10-meter wind vector fields and mean sea level pressure (MSLP) contours derived from gridded reanalysis data. The wind field is decomposed into its zonal (u) and meridional (v) components, and the wind speed magnitude is computed and expressed in km/h to enhance interpretability. Pressure contours are overlaid to reveal the synoptic-scale structure of pressure systems, enabling the analysis of the pressure gradient force and its influence on wind flow. The visualization reflects fundamental physical balances, particularly the interplay between the pressure gradient force, Coriolis force, and surface friction, which collectively govern wind behavior in the lower troposphere. The result is a spatiotemporal depiction of wind and pressure evolution that provides insight into cyclonic and anticyclonic circulations, wind convergence zones, and regions of strong dynamical forcing. This framework supports both qualitative and quantitative assessments of atmospheric processes central to weather analysis and forecasting.
PHYSICAL MEANING OF PLOTTED VARIABLES --
Wind Components: u10 and v10
These are the zonal (east-west) and meridional (north-south) components of wind at 10 meters above the surface:
u10: wind in the x-direction (positive → eastward)
v10: wind in the y-direction (positive → northward)
They're derived from the momentum equations in the atmosphere and reflect the resultant force balance, typically:
$$\text{Wind = balance between pressure gradient force, Coriolis force, and friction}$$At 10 m, friction is still significant, so wind isn’t in geostrophic balance.
Mean Sea Level Pressure: msl
This is the atmospheric pressure reduced to sea level (in hPa):
Lower MSLP typically corresponds to cyclones (low-pressure centers).
Higher MSLP → anticyclones (high-pressure ridges).
The pressure field is key to understanding the pressure gradient force (PGF), the main driver of wind:
$$\vec{F}_{\mathrm{PG}} = -\frac{1}{\rho} \nabla p$$Wind tends to blow from high to low pressure, deflected by the Coriolis effect.
Wind Speed (Magnitude):
Computed as:
$$|\vec{V}| = \sqrt{u^2 + v^2}$$Converted to km/h using:
$$|\vec{V}_{\mathrm{km/h}}| = \sqrt{u^2 + v^2} \times 3.6$$METEOROLOGICAL INTERPRETATION OF THE PLOT --
Wind Vectors (Arrows):
The direction shows where the wind is going.
The length (and color) reflects the speed.
If arrows circulate counterclockwise around a low, it's consistent with a cyclonic system (in Northern Hemisphere).
MSLP Contours:
Closely spaced contours ⇒ strong pressure gradient ⇒ strong winds
Wind typically blows nearly parallel to the isobars, but slightly across them toward low pressure due to surface friction.
MATH AND PHYSICS IN THE PROGRAMMING --
Gradient Field Visualization (MSLP)
The contour() function plots scalar isopleths of the pressure field:
$$\text{Countours represent constant}\,\,p = MSLP(x,y)$$These help visualize gradients:
$$\nabla p = \left(\frac{\partial p}{\partial x}, \frac{\partial p}{\partial y}\right)$$Steeper gradients → stronger pressure gradient force → stronger wind.
Vector field (Wind Arrows)
Plotting vectors $\vec{V} = (u,v)$ at each grid point using quiver().
This visually solves:abs
$$\vec{V}(x,y) = u(x,y)\mathbf{i} + v(x,y)\mathbf{j}$$And uses color to reflect:
$$|\vec{V}| = \sqrt{u^2 + v^2}$$Wind Key and Colorbar
The quiver key and colorbar provide a scale for interpreting vector length and color:
quiverkey(q, 0.9, -0.1, 36, '36 km/h'): shows that arrow length corresponds to 36 km/h wind
Colorbar maps magnitude to color — helpful for spotting high wind regions.
OVERALL --
This plot helps meteorologists and researchers:
Diagnose pressure systems (e.g., tropical cyclones)
Understand wind responses to pressure gradients
Identify potential hazards (e.g., high winds, storm tracks)
import xarray as xr
import matplotlib.pyplot as plt
import numpy as np
import cartopy.crs as ccrs
import cartopy.feature as cfeature
# Load dataset
ds_wind = xr.open_dataset("data_stream-oper_stepType-instant.nc")
# Coordinates
lon = ds_wind['longitude']
lat = ds_wind['latitude']
lon2d, lat2d = np.meshgrid(lon, lat)
# Loop over first 28 time steps
for time_idx in range(28):
# Extract 10m wind components and pressure
u = ds_wind['u10'].isel(valid_time=time_idx)
v = ds_wind['v10'].isel(valid_time=time_idx)
msl = ds_wind['msl'].isel(valid_time=time_idx) / 100 # Convert Pa to hPa
# Compute wind speed (in km/h)
wind_speed_kmh = np.sqrt(u**2 + v**2) * 3.6
# Set up figure with geographic projection
fig, ax = plt.subplots(figsize=(10, 6), subplot_kw={'projection': ccrs.PlateCarree()})
ax.set_extent([lon.min(), lon.max(), lat.min(), lat.max()], crs=ccrs.PlateCarree())
# Add geographic features
ax.coastlines()
ax.add_feature(cfeature.BORDERS, linestyle=':')
ax.add_feature(cfeature.LAND, edgecolor='black', facecolor='lightgray')
# Plot pressure contours
cs = ax.contour(lon2d, lat2d, msl, levels=20, colors='black', linewidths=0.6)
ax.clabel(cs, inline=True, fontsize=8, fmt='%1.0f hPa')
# Plot wind vectors (skip every 2 to reduce clutter)
skip = (slice(None, None, 2), slice(None, None, 2))
q = ax.quiver(
lon2d[skip], lat2d[skip],
u.values[skip], v.values[skip],
wind_speed_kmh.values[skip], # Color by wind speed
cmap='viridis',
scale=None, # Let quiver auto-scale to show magnitude
pivot='middle', # Arrow pivots in center
width=0.0028,
transform=ccrs.PlateCarree()
)
# Colorbar showing wind speed in km/h
cb = plt.colorbar(q, ax=ax, orientation='vertical', label='Wind Speed (km/h)')
# Quiver key (interpretive arrow)
ax.quiverkey(q, 0.9, -0.1, 36, '36 km/h', labelpos='E', coordinates='axes')
# Reference "+" marker
center_lat = lat.values[len(lat) // 2]
center_lon = lon.values[len(lon) // 2]
ax.plot(center_lon, center_lat, marker='+', color='red', markersize=12, transform=ccrs.PlateCarree())
ax.text(center_lon + 0.2, center_lat, 'Reference Point', color='red', fontsize=9, transform=ccrs.PlateCarree())
# Title with timestamp
time_str = np.datetime_as_string(ds_wind['valid_time'][time_idx].values, unit='h')
ax.set_title(f"10m Wind Vectors and MSLP (km/h)\n{time_str}")
plt.tight_layout()
plt.show()
Time-Resolved Comparison Between Observed Surface Wind Speed (V_obs) and a Physically Diagnosed Wind Speed¶
The succeeding code implements a time-resolved comparison between observed surface wind speed (V_obs) and a physically diagnosed wind speed (V_gradient) derived from the cyclostrophic-gradient wind equation using reanalysis data. Below is a step-by-step mathematical interpretation of the algorithm:
The programming uses meteorological variables (10m wind, mean sea level pressure, 2m temperature) to:
1. Estimate the observed wind magnitude from vector components:
$$V_{\text{obs}} = \sqrt{u^2 + v^2}$$ 2. Diagnose the expected wind magnitude (V_gradient) from the full gradient wind equation, which includes pressure gradient force, Coriolis force, and centrifugal force.
3. Compare them using RMSE and bias over time.
Constants:
$\Omega = 7.292\,\times\,10^{-5}\,\text{rad/s}:\text{Earth's rotation Rrate}$
$R_d\,= 287.058\,\text{J/Kg/K}:\text{Gas constant for dry air}$
$R_{earth}\,= 6.371\,\times\,10^{6}\text{m}:\,\text{Earth's radius}$
MATHEMATICAL INTERPRETATION --
Observed wind speed: $V_{\text{obs}} = \sqrt{u^2 + v^2}$
Coriolis Parameter: $f = 2 \Omega \sin(\phi)$, where $\phi$ latitude. A minimum threshold (numerical safeguard) is set:
$$f = \begin{cases} f & \text{if } |f| > 10^{-10} \\ 10^{-10} & \text{otherwise} \end{cases} $$Air density for the ideal gas law: $\rho = \frac{P}{R_d T}$
where $T$ is in Kelvin and $P$ in Pascals. A fallback value of 1.225 Kg/$\text{m}^3$ is applied where needed.
Pressure Gradient Force Term in Wind Direction:
$$ \text{PGF} = \frac{1}{\rho} \left( \frac{\partial P}{\partial x} \cdot \left(-\frac{v}{V_{\text{obs}}}\right) + \frac{\partial P}{\partial y} \cdot \left(\frac{u}{V_{\text{obs}}}\right) \right) $$Advective Acceleration Terms:
$$a_x = u \frac{\partial u}{\partial x} + v \frac{\partial u}{\partial y}$$$$a_y = u \frac{\partial v}{\partial x} + v \frac{\partial v}{\partial y}$$a_x
Curvature term (Normal Acceleration):
$$\text{Normal Acceleration} = - \left( a_x \cdot \left( -\frac{v}{V_{\text{obs}}} \right) + a_y \cdot \left( \frac{u}{V_{\text{obs}}} \right) \right)$$Gradient Wind Equation: $\frac{V^2}{R} + fV = \frac{1}{\rho} |\nabla P|$ written as a quadratic in $V$:
$a V^2 + b V + c = 0$, where: $a = \frac{1}{R}, \quad b = f, \quad c = -\text{PGF}$
Quadratic Formula for Gradient Wind Speed: $V = \frac{-f \pm \sqrt{f^2 - 4ac}}{2a}$. Retaining only real and positive roots --
$$V_{\text{gradient}} = \begin{cases} \min(\text{positive roots}) & \text{if } R > 0 \\ \max(\text{positive roots}) & \text{if } R < 0 \\ \text{positive root} & \text{if } R = 0 \\ \text{NaN} & \text{if } \Delta < 0 \end{cases} $$RMSE:
$$\text{RMSE} = \sqrt{ \frac{1}{N} \sum_{i=1}^{N} \left( V_{\text{obs},i} - V_{\text{gradient},i} \right)^2 }$$a_x
Bias:
$$\text{Bias} = \frac{1}{N} \sum_{i=1}^{N} \left( V_{\text{obs},i} - V_{\text{gradient},i} \right)$$SUMMARIZING --
This code performs a diagnostic validation of the gradient wind balance across time using reanalysis data:
1. It compares the physically diagnosed wind (from force balances) with actual observed winds.
2. It accounts for pressure gradient force, Coriolis effect, and centrifugal acceleration (via curvature).
3. The discrepancy is quantified via RMSE and bias over multiple time steps.
Analytical Introduction for Gradient Wind Equation Verification¶
The following script performs a physical diagnostic analysis to compare observed 10-meter wind speeds with gradient wind speeds derived from fundamental atmospheric dynamics. The gradient wind balance is an extension of geostrophic balance that accounts for curved flow and centrifugal forces, often used to approximate wind speeds around cyclones or curved flow features.
Key Concepts and Steps:
Input Data:
The analysis uses gridded atmospheric variables from a reanalysis or model dataset, including:
Zonal and meridional 10-meter wind components (u10 and v10)
Mean sea level pressure (msl)
2-meter temperature (t2m)
Latitude and longitude coordinates
Unit Conversion and Preprocessing:
Pressure is converted from hPa to Pa if needed, and temperature from Celsius to Kelvin, to ensure correct physical units for calculations.
Observed Wind Speed Calculation:
The observed wind speed magnitude V_obs is computed from the zonal and meridional components.
Coriolis Parameter (f):
Calculated based on latitude, it represents the effect of Earth's rotation on moving air parcels.
Air Density (ρ):
Estimated using the ideal gas law from pressure and temperature, providing the air mass per unit volume needed for force calculations.
Spatial Grid Distances (dx, dy):
Horizontal distances between grid points are calculated in meters, accounting for Earth's curvature, to correctly compute spatial derivatives of the fields.
Smoothing:
Gaussian filtering smooths the wind and pressure fields, reducing noise before gradient computations.
Gradient Calculations:
Spatial derivatives of pressure and wind components are calculated using finite differences, scaled by physical distances.
Pressure Gradient Force Term:
The acceleration due to pressure gradient force projected onto the tangential direction of the flow is computed.
Curvature and Gradient Wind Terms:
The curvature radius of flow is estimated from velocity gradients.
Using the curvature radius, Coriolis parameter, and pressure gradient force, a quadratic equation for the gradient wind speed magnitude is formulated.
Gradient Wind Speed Solution:
The quadratic equation yields two possible gradient wind speeds at each grid point. Physically consistent positive roots are selected based on the sign of curvature.
Error Metrics:
Differences between observed and gradient wind speeds are computed over valid grid points, then summarized as Root Mean Square Error (RMSE) and bias over each time step. A lower RMSE means smaller errors, implying the model's predictions are more accurate. RMSE is sensitive to "outliers", meaning a single very large error can disproportionately increase the RMSE value
Results Visualization:
RMSE and bias are plotted over time to assess the accuracy and systematic differences of the gradient wind approximation relative to observed winds.
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter
from tqdm import tqdm # for progress bar
ds = xr.open_dataset("data_stream-oper_stepType-instant.nc")
# --- Constants ---
OMEGA = 7.292e-5
Rd = 287.058
KELVIN_OFFSET = 273.15
R_earth = 6371000
all_rmse = []
all_bias = []
all_valid_times = []
plt.ioff()
print("\nStarting analysis across all time steps...")
for t_idx in tqdm(range(len(ds['valid_time'])), desc="Processing time steps"):
try:
ds_slice = ds.isel(valid_time=t_idx).squeeze()
# Extract variables
U = ds_slice['u10']
V = ds_slice['v10']
P = ds_slice['msl']
T = ds_slice['t2m']
LAT = ds_slice['latitude']
LON = ds_slice['longitude']
# Convert units
if P.max() < 120000 and P.max() > 900:
P = P * 100
if T.max() < 100:
T = T + KELVIN_OFFSET
V_obs = np.sqrt(U**2 + V**2)
f = 2 * OMEGA * np.sin(np.deg2rad(LAT))
f = f.where(np.abs(f) > 1e-10, other=1e-10)
rho = P / (Rd * T)
rho = rho.where(rho > 0, other=1.225)
# Scalar dx, dy for gradient
delta_lon_deg = np.diff(LON)[0].item()
delta_lat_deg = np.diff(LAT)[0].item()
mean_lat_rad = np.deg2rad(LAT.mean().item())
dx_m = R_earth * np.cos(mean_lat_rad) * np.deg2rad(delta_lon_deg)
dy_m = R_earth * np.deg2rad(delta_lat_deg)
sigma_filter = 1.0
U_s = gaussian_filter(U.values, sigma=sigma_filter)
V_s = gaussian_filter(V.values, sigma=sigma_filter)
P_s = gaussian_filter(P.values, sigma=sigma_filter)
dP_dy, dP_dx = np.gradient(P_s, dy_m, dx_m, axis=(0, 1))
dU_dy, dU_dx = np.gradient(U_s, dy_m, dx_m, axis=(0, 1))
dV_dy, dV_dx = np.gradient(V_s, dy_m, dx_m, axis=(0, 1))
dP_dx = xr.DataArray(dP_dx, coords=P.coords, dims=P.dims)
dP_dy = xr.DataArray(dP_dy, coords=P.coords, dims=P.dims)
V_obs_safe = V_obs.where(V_obs > 1e-6, other=1e-6)
Pressure_Force_Term = (1 / rho) * (dP_dx * (-V / V_obs_safe) + dP_dy * (U / V_obs_safe))
ax = U * xr.DataArray(dU_dx, coords=U.coords, dims=U.dims) + V * xr.DataArray(dU_dy, coords=U.coords, dims=U.dims)
ay = U * xr.DataArray(dV_dx, coords=V.coords, dims=V.dims) + V * xr.DataArray(dV_dy, coords=V.coords, dims=V.dims)
Gradient_Term_V2_R = -(ax * (-V / V_obs_safe) + ay * (U / V_obs_safe))
R_curvature = V_obs_safe**2 / Gradient_Term_V2_R.where(np.abs(Gradient_Term_V2_R) > 1e-6, other=np.nan)
R_curvature_safe = R_curvature.where(
(~np.isnan(R_curvature)) & (~np.isinf(R_curvature)) & (np.abs(R_curvature) > 1e-1),
other=np.nan
)
a = 1 / R_curvature_safe
b = f
c = -Pressure_Force_Term
discriminant = b**2 - 4 * a * c
discriminant = discriminant.where(discriminant >= 0, other=np.nan)
V_gradient = xr.full_like(V_obs, np.nan)
for i in range(V_gradient.shape[0]):
for j in range(V_gradient.shape[1]):
r_val = R_curvature_safe[i, j].item()
f_val = f[i].item()
p_force = Pressure_Force_Term[i, j].item()
if np.isnan(r_val) or np.isnan(f_val) or np.isnan(p_force):
continue
a_val = 1 / r_val
b_val = f_val
c_val = -p_force
disc_val = b_val**2 - 4 * a_val * c_val
if disc_val < 0:
continue
v_plus = (-b_val + np.sqrt(disc_val)) / (2 * a_val)
v_minus = (-b_val - np.sqrt(disc_val)) / (2 * a_val)
positive_roots = [v for v in [v_plus, v_minus] if v > 0]
if r_val > 0 and positive_roots:
V_gradient[i, j] = min(positive_roots)
elif r_val < 0 and positive_roots:
V_gradient[i, j] = max(positive_roots)
elif r_val == 0 and positive_roots:
V_gradient[i, j] = positive_roots[0]
mask = (V_obs > 0.5) & (~np.isnan(V_gradient))
diff = (V_obs - V_gradient).where(mask)
if diff.count() > 0:
current_rmse = np.sqrt((diff**2).mean()).item()
current_bias = diff.mean().item()
all_rmse.append(current_rmse)
all_bias.append(current_bias)
all_valid_times.append(ds_slice.valid_time.item())
except Exception as e:
print(f"Error processing time step {t_idx}: {e}")
# --- Final Summary and Plotting ---
if all_valid_times:
fig1, ax1 = plt.subplots(figsize=(10, 5))
ax1.plot(all_valid_times, all_rmse, marker='o')
ax1.set_title("RMSE over Time")
ax1.set_ylabel("RMSE (m/s)")
ax1.set_xlabel("Time")
ax1.grid(True)
plt.tight_layout()
plt.savefig("rmse_time_series.png")
plt.close()
fig2, ax2 = plt.subplots(figsize=(10, 5))
ax2.plot(all_valid_times, all_bias, marker='o', color='red')
ax2.set_title("Bias (V_obs - V_gradient) over Time")
ax2.set_ylabel("Bias (m/s)")
ax2.set_xlabel("Time")
ax2.axhline(0, linestyle='--', color='gray')
ax2.grid(True)
plt.tight_layout()
plt.savefig("bias_time_series.png")
plt.close()
print("Plots saved: rmse_time_series.png, bias_time_series.png")
else:
print("No valid time steps were successfully processed. Check data availability and variable names.")
Starting analysis across all time steps...
Processing time steps: 100%|██████████| 1281/1281 [06:38<00:00, 3.21it/s]
Plots saved: rmse_time_series.png, bias_time_series.png
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
img1 = mpimg.imread('rmse_time_series.png')
plt.imshow(img1)
plt.axis('off') # Hide axes
plt.show()
img2 = mpimg.imread('bias_time_series.png')
plt.imshow(img2)
plt.axis('off') # Hide axes
plt.show()
RMSE over Time (rmse_time_series.png):¶
Definition of RMSE: Root Mean Square Error (RMSE) is a measure of the magnitude of the errors between predicted values (in this case, likely V_gradient) and observed values (V_obs). It's always non-negative, and a lower RMSE indicates a better fit of the model to the data.
Overall Trend: The RMSE values generally fluctuate between 0 and around 700 m/s for most of the time.
Spikes/Outliers: Similar to the bias plot, there are prominent upward spikes in the RMSE, with the largest one exceeding 2500 m/s. These spikes correspond to periods of large errors between the observed and gradient-derived velocities.
Relationship to Bias Spikes: It's highly probable that the spikes in the RMSE plot correspond to the spikes in the negative bias plot. When the bias is extremely negative (meaning a large difference between V_obs and V_gradient), the magnitude of the error (RMSE) will naturally be high. This suggests that the same underlying events or issues are causing both the large negative biases and the large errors
Bias (V_obs - V_gradient) over Time (bias_time_series.png):¶
Definition of Bias: Bias here represents the difference between an observed velocity (V_obs) and a gradient-derived velocity (V_gradient). A negative bias means that the observed velocity is generally lower than the gradient-derived velocity, or the gradient-derived velocity is generally higher than the observed velocity. A positive bias would mean the opposite.
Overall Trend: The plot clearly shows a consistent negative bias over the entire time period. The values hover mostly between 0 and -500 m/s, indicating that V_obs is consistently smaller than V_gradient.
Spikes/Outliers: There are several significant downward spikes where the bias becomes much more negative (e.g., reaching -1500 m/s to over -2000 m/s). These spikes indicate periods where the V_gradient significantly overestimates V_obs, or V_obs is significantly lower than expected by the V_gradient. These could be due to:
Sudden changes in the observed system that the gradient model doesn't capture well.
Measurement errors in V_obs.
Issues with the V_gradient calculation during those specific times.
Near-Zero Line: The dashed grey line at 0 m/s represents the ideal scenario of no bias. The data consistently stays below this line.
Summary and Potential Interpretations:¶
Systematic Underestimation/Overestimation: There is a systematic negative bias, meaning V_gradient consistently estimates higher velocities than V_obs (or V_obs consistently shows lower velocities than V_gradient). This suggests a fundamental discrepancy or calibration issue in the system, or that the V_gradient model is inherently biased for this particular phenomenon.
Episodic Large Errors: Both plots highlight specific time points where the discrepancy between V_obs and V_gradient becomes exceptionally large, leading to both very high negative bias and very high RMSE. These events warrant further investigation.
I. Possible Causes for Spikes: These could be related to:
Physical Events: Sudden and intense physical phenomena (e.g., strong wind gusts, seismic activity, rapid changes in the medium being measured) that the V_gradient model is not equipped to handle.
Sensor Malfunctions: Temporary issues with the V_obs sensor readings. However, ERA5 is highly credible and respected sourxe of meteorological and climatology data.
Model Limitations: The V_gradient model might have limitations under certain extreme conditions.
Data Quality Issues: Outliers or corrupted data points in either V_obs or V_gradient.
Model Performance: While there's a systematic bias, the general RMSE outside of the spikes appears to be relatively consistent, suggesting that the model performs somewhat predictably under "normal" conditions, but struggles significantly during specific events.
Time-Resolved Comparison Between Observed Surface Wind Speed (V_obs) and a Physically Diagnosed Wind Speed¶
Haversine Distance from Cyclone Center. Let:
$$\phi\,,\lambda: \text{grid latitude and longitude (in radians)}$$$$\phi_{c}\,,\lambda_{c}: \text{cyclone centre latitude and longitude (in radians)}$$Then:
$$a = \sin^2\left(\frac{\phi - \phi_c}{2}\right) + \cos(\phi_c)\cos(\phi)\sin^2\left(\frac{\lambda - \lambda_c}{2}\right) $$c = 2 \arcsin\left( \sqrt{a} \right)$$
$$r = R_{\text{earth}} \cdot c$$Grid Distance Conversion (given grid spacings in degrees):
$$\Delta y = R_{\text{earth}} \cdot \Delta \phi, \quad \Delta x = R_{\text{earth}} \cdot \cos(\phi_c) \cdot \Delta \lambda $$Pressure Gradient in Radial Direction --
using gradients:
$$\frac{\partial P}{\partial y}, \quad \frac{\partial P}{\partial x}$$Compute unit radial vectors:
$$\hat{r}_x = \frac{\Delta \lambda}{\sqrt{(\Delta \phi)^2 + (\Delta \lambda)^2 + \epsilon}}, \quad \hat{r}_y = \frac{\Delta \phi}{\sqrt{(\Delta \phi)^2 + (\Delta \lambda)^2 + \epsilon}}$$Then radial pressure gradient:
$$\frac{\partial P}{\partial r} = \frac{\partial P}{\partial x} \cdot \hat{r}_x + \frac{\partial P}{\partial y} \cdot \hat{r}_y$$Cyclostrophic Wind Equation --Assuming balance between centrifugal and pressure gradient forces:
$$\frac{V^2}{r} = \frac{1}{\rho} \cdot \frac{\partial P}{\partial r}$$Solving for wind speed:
$$V_{\text{cyclo}} = \sqrt{ \frac{r}{\rho} \cdot \frac{\partial P}{\partial r} }$$Only (real) positive values are considered:
$$V_{\text{cyclo}} = \begin{cases} \sqrt{ \frac{r}{\rho} \cdot \frac{\partial P}{\partial r} }, & \text{if } \frac{r}{\rho} \cdot \frac{\partial P}{\partial r} \geq 0 \\ \text{NaN}, & \text{otherwise} \end{cases} $$Bias and RMSE in Cyclostrophic Wind Analysis¶
After calculating the theoretical cyclostrophic wind speed from pressure and temperature fields, we want to evaluate how closely it matches the actual observed wind.
Let:
$V_obs$ : observed wind speed (from u10 and v10)
$V_cyclo$ : cyclostrophic wind speed (from analytical balance)
Bias:
$$\text{Bias} = \frac{1}{N}\sum_{i=1}^{N} \left( V_{\text{obs}, i} - V_{\text{cyclo}, i} \right)$$represents systematic under- or over-estimation of wind by the cyclostrophic assumption.
RMSE:
$$\text{RMSE} = \sqrt{ \frac{1}{N} \sum_{i=1}^{N} \left( V_{\text{obs}, i} - V_{\text{cyclo}, i} \right)^2 }$$measures the overall magnitude of the error, including random noise and bias.
Metrics Bias and RMSE are computed over all valid (non-masked, finite, positive wind) grid points for each time step, then tracked over time to visualize model performance.
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
from scipy.ndimage import gaussian_filter
import warnings
# --- Suppress sqrt warnings ---
warnings.filterwarnings("ignore", message="invalid value encountered in sqrt")
# --- Constants ---
Rd = 287.058
KELVIN_OFFSET = 273.15
R_earth = 6371000
# --- Cyclone center ---
center_lat = 20.0
center_lon = -70.0
# --- Storage ---
all_bias = []
all_rmse = []
all_times = []
# --- Begin Loop ---
for t_idx in range(len(ds['valid_time'])):
try:
ds_slice = ds.isel(valid_time=t_idx).squeeze()
# --- Variables ---
P = ds_slice['msl']
T = ds_slice['t2m']
U = ds_slice['u10']
V = ds_slice['v10']
LAT = ds_slice['latitude']
LON = ds_slice['longitude']
# --- Unit Conversion ---
if P.max() < 120000: P *= 100
if T.max() < 100: T += KELVIN_OFFSET
# --- Observed Wind ---
V_obs = np.sqrt(U**2 + V**2).where(lambda x: x > 0.5)
# --- Air Density ---
rho = P / (Rd * T)
# --- Smooth Pressure ---
P_sm = gaussian_filter(P.values, sigma=1.0)
# --- Radius from Center ---
lon2d, lat2d = np.meshgrid(LON.values, LAT.values)
lon_rad = np.deg2rad(lon2d)
lat_rad = np.deg2rad(lat2d)
clat_rad = np.deg2rad(center_lat)
clon_rad = np.deg2rad(center_lon)
dlat = lat_rad - clat_rad
dlon = lon_rad - clon_rad
a = np.sin(dlat/2)**2 + np.cos(clat_rad)*np.cos(lat_rad)*np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = R_earth * c
# --- Pressure Gradient ---
dlat_deg = np.diff(LAT)[0].item()
dlon_deg = np.diff(LON)[0].item()
dy = R_earth * np.deg2rad(dlat_deg)
dx = R_earth * np.cos(clat_rad) * np.deg2rad(dlon_deg)
dP_dy, dP_dx = np.gradient(P_sm, dy, dx)
unit_radial_x = dlon / np.sqrt(dlat**2 + dlon**2 + 1e-10)
unit_radial_y = dlat / np.sqrt(dlat**2 + dlon**2 + 1e-10)
dP_dr = dP_dx * unit_radial_x + dP_dy * unit_radial_y
# --- Cyclostrophic Wind ---
dP_dr_da = xr.DataArray(dP_dr, coords=P.coords, dims=P.dims)
r_da = xr.DataArray(r, coords=P.coords, dims=P.dims)
term = (r_da / rho) * dP_dr_da
term = term.where(term >= 0)
V_cyclo = np.sqrt(term)
# --- Error Metrics ---
diff = (V_obs - V_cyclo).where(~np.isnan(V_obs) & ~np.isnan(V_cyclo))
if diff.count() > 0:
rmse = np.sqrt((diff**2).mean()).item()
bias = diff.mean().item()
all_rmse.append(rmse)
all_bias.append(bias)
all_times.append(ds_slice.valid_time.item())
except Exception:
continue
# --- Plot Results ---
if all_times:
fig, ax = plt.subplots(1, 2, figsize=(14, 5))
ax[0].plot(all_times, all_rmse, marker='o')
ax[0].set_title("Cyclostrophic Wind RMSE (Observed vs Computed)")
ax[0].set_ylabel("RMSE (m/s)")
ax[0].set_xlabel("Time")
ax[0].grid(True)
ax[1].plot(all_times, all_bias, marker='o', color='red')
ax[1].set_title("Cyclostrophic Wind Bias (Observed - Computed)")
ax[1].set_ylabel("Bias (m/s)")
ax[1].set_xlabel("Time")
ax[1].axhline(0, color='gray', linestyle='--')
ax[1].grid(True)
plt.tight_layout()
plt.savefig("cyclostrophic_rmse_bias.png")
plt.close()
print("✅ Plots saved: cyclostrophic_rmse_bias.png")
else:
print("❌ No valid results to plot.")
✅ Plots saved: cyclostrophic_rmse_bias.png
cyclo_img = mpimg.imread('cyclostrophic_rmse_bias.png')
plt.imshow(cyclo_img)
plt.axis('off') # Hide axes
plt.show()
Joint Plot V-obs vs V_cyclo¶
A joint plot (also called a scatter or density plot) of V_obs vs V_cyclo helps to visually assess how well the cyclostrophic wind matches observed wind speeds.
import pandas as pd
import seaborn as sns
# Collect all valid V_obs and V_cyclo pairs across all time steps
all_v_obs = []
all_v_cyclo = []
for t_idx in range(len(ds['valid_time'])):
try:
ds_slice = ds.isel(valid_time=t_idx).squeeze()
# Fields
P = ds_slice['msl']
T = ds_slice['t2m']
U = ds_slice['u10']
V = ds_slice['v10']
LAT = ds_slice['latitude']
LON = ds_slice['longitude']
if P.max() < 120000:
P = P * 100
if T.max() < 100:
T = T + KELVIN_OFFSET
V_obs = np.sqrt(U**2 + V**2)
rho = P / (Rd * T)
P_sm = gaussian_filter(P.values, sigma=1.0)
lon2d, lat2d = np.meshgrid(LON.values, LAT.values)
lon_rad = np.deg2rad(lon2d)
lat_rad = np.deg2rad(lat2d)
clat_rad = np.deg2rad(center_lat)
clon_rad = np.deg2rad(center_lon)
dlat = lat_rad - clat_rad
dlon = lon_rad - clon_rad
a = np.sin(dlat/2)**2 + np.cos(clat_rad)*np.cos(lat_rad)*np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = R_earth * c
dlat_deg = np.diff(LAT)[0].item()
dlon_deg = np.diff(LON)[0].item()
dy = R_earth * np.deg2rad(dlat_deg)
dx = R_earth * np.cos(clat_rad) * np.deg2rad(dlon_deg)
dP_dy, dP_dx = np.gradient(P_sm, dy, dx)
unit_radial_x = dlon / np.sqrt(dlat**2 + dlon**2 + 1e-10)
unit_radial_y = dlat / np.sqrt(dlat**2 + dlon**2 + 1e-10)
dP_dr = dP_dx * unit_radial_x + dP_dy * unit_radial_y
dP_dr_da = xr.DataArray(dP_dr, coords=P.coords, dims=P.dims)
r_da = xr.DataArray(r, coords=P.coords, dims=P.dims)
V_cyclo = np.sqrt((r_da / rho) * dP_dr_da)
V_cyclo = V_cyclo.where(V_cyclo >= 0)
# Flatten and filter
v_obs_flat = V_obs.values.flatten()
v_cyclo_flat = V_cyclo.values.flatten()
valid = (~np.isnan(v_obs_flat)) & (~np.isnan(v_cyclo_flat)) & (v_obs_flat > 0.5)
all_v_obs.extend(v_obs_flat[valid])
all_v_cyclo.extend(v_cyclo_flat[valid])
except Exception as e:
print(f"Skip step {t_idx}: {e}")
# Create DataFrame
df = pd.DataFrame({
"V_obs": all_v_obs,
"V_cyclo": all_v_cyclo
})
# Plot: Seaborn jointplot
sns.set(style="whitegrid")
g = sns.jointplot(data=df, x="V_obs", y="V_cyclo", kind="hex", color="teal", height=8)
g.ax_joint.plot([0, max(df["V_obs"].max(), df["V_cyclo"].max())],
[0, max(df["V_obs"].max(), df["V_cyclo"].max())],
linestyle="--", color="black")
g.ax_joint.set_xlabel("Observed Wind Speed (m/s)")
g.ax_joint.set_ylabel("Cyclostrophic Wind Speed (m/s)")
plt.suptitle("Joint Distribution: V_obs vs V_cyclo", y=1.02)
plt.tight_layout()
plt.savefig("jointplot_vobs_vs_vcyclo.png")
plt.close()
print("📊 Saved: jointplot_vobs_vs_vcyclo.png")
📊 Saved: jointplot_vobs_vs_vcyclo.png
joint_cyclo_img = mpimg.imread('jointplot_vobs_vs_vcyclo.png')
plt.imshow(joint_cyclo_img)
plt.axis('off') # Hide axes
plt.show()
A joint plot is a powerful diagnostic tool that visually compares two related variables—in this case:
V_obs: the observed 10-meter wind speed, calculated as
$ V_{\text{obs}}$ : the observed 10-meter wind speed, calculated as
$$V_{\text{obs}} = \sqrt{u_{10}^2 + v_{10}^2}$$$V_{\text{cyclo}}$ : the theoretical cyclostrophic wind speed, computed from the balance of centrifugal and pressure gradient forces:
$$V_{\text{cyclo}} = \sqrt{ \frac{r}{\rho} \frac{\partial p}{\partial r} }$$The joint plot provides a visual check on how well the cyclostrophic wind approximation matches reality:
Are values clustered near the 1:1 line?
Are there biases (e.g., over/underestimation)?
Is the spread wide or tight (random error)?
Scatter or Hexbin Plot (Center Panel)
This shows how each observed wind speed matches its corresponding cyclostrophic estimate:
Each point = one grid cell at one time step
X-axis = V_obs (real wind from ERA5)
Y-axis = V_cyclo (diagnosed wind from pressure field)
Diagonal line = perfect match: V_obs = V_cyclo
Below the line = model underestimates wind
Above the line = model overestimates wind
Marginal Histograms (Top and Right Panels)
These show the distribution of:
V_obs values (top)
V_cyclo values (right)
This helps identify:
Outliers
Skewness
Truncation (e.g., if V_cyclo is zero in many grid points)
Good agreement: if points lie along or close to the diagonal
Bias: if points systematically fall above or below the line
Nonlinearity: e.g., agreement at low speeds but divergence at high speeds
Dispersion: vertical spread at each V_obs value shows noise or model limitations
A More General Description Involing Wind Speed and Pressure - Horizontal Momentum Equation (in Rotating Frame)¶
In a rotating reference frame (like Earth), the Navier-Stokes equations in vector form (neglecting vertical motion and curvature terms) reduce to the primitive horizontal momentum equation:
$$\frac{D\vec{V}}{Dt} = -\frac{1}{\rho} \nabla p - f \hat{k} \times \vec{V} + \vec{F}_r$$$f$ being the coriolis parameter $f = 2\Omega\text{sin}\phi$
$\hat{k}$ being the unit vector in vertical (z) direction
$\vec{F}_r$ being the frictional force vector (turbulence/stress drag)
Coriolis Force: $-f\hat{k}\,\times\,\,\vec{V}$
Frictional Force: $\vec{F}_r = \nu \nabla^2 \vec{V}$
Summary of the Full Horizontal Momentum Equation (FHME):
This governs the horizontal wind behaviour in the lower atmosphere and explains the physical basis of what you're plotting:
Pressure gradients → initiate wind
Coriolis force → deflects wind (causing rotation)
Friction → slows it down near surface
NOTE: for future interest, FHME can be applied to RMSE and Bias analysis when contrasted with observed data.
Multinomial Logistic Model for Storm Events¶
Reverting back to Montserrat based hourly data concerning extreme events. Multinomial logistic regression is an extension of binomial logistic regression applied for predicting categorical outcomes exceeding more than two classes.
Model Description¶
For $K$ number of classes in the categorical response variable $Y$, which can take on values $y \in\,\,{1,2,...,K}$;
$X$ being a vector of predictors, say, $X = [X_1,X_2,...,X_n]$, where $n$ is the number of predictors.
Probability Model¶
The multinomial logistic regression model estimates the probability of each class $k$ given the predictors $X$:
$$P(Y = k|X) = \frac{e^{\beta_{0k} + \beta_{1k}}X_1 + \beta_{2k}X_2 + \ldots + \beta_{pk}X_p}{\sum_{j=1}^{k} e^{\beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \ldots + \beta_{pj}X_p}}$$where:
$P(Y = k|X)$ is the probability that the dependent variable $Y$ is equal to class $k$ given the predictor variables $X$;
$\beta_{0k}$ being the intercept for class $k$;
$\beta_{ik}$ being the coefficient for predctor $X_i$ for class $k$.
Reference Class¶
Customarily, one class is chosen as the reference class (typically class 1), and the probabilities for other classes are modeled relative to this reference class. Namely:
$$P(Y = k|X) = \frac{e^{\beta_{0k} + \beta_{1k}}X_1 + \beta_{2k}X_2 + \ldots + \beta_{pk}X_p}{1+ \sum_{j=2}^{k} e^{\beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \ldots + \beta_{pj}X_p}}\,\,\,\text{for}\,\,k = 2,..,K$$where the probability for the reference class (class 1) is:
$$P(Y = k|X) = \frac{1}{\sum_{j=1}^{k} e^{\beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + \ldots + \beta_{pj}X_p}}$$Log Odds Ratios¶
The Log Odds Ratios (logit) for class $k$ relative to the reference class can be expressed as:
$$\text{log}(\frac{Y = k|X}{Y = 1|X}) = e^{\beta_{0k} + \beta_{1k}}X_1 + \beta_{2k}X_2 + \ldots + \beta_{pk}X_p$$Such above exhibits the the log-odds of being in class $k$ relative to reference class can be modeled as a linear combination of the predictors.
Estimation¶
The coefficients $\beta_{ik}$ are estimated using the maximum likelihood estimation (MLE), finding the set of paraneters that maximizes the likelihood of the observed data given the model.
Abstract¶
The multinomial logistic regression model predicts the probabilities of different classes (categories) based on features via softmax (or sigmoid) function that transforms the linear combination of features into probabilities, ensuring that they sum to 1 across all classes.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
multi_logit_data = hourly_dataframe_extreme_clean.copy()
# Prepare features and target
X = multi_logit_data[['temperature_2m', 'rain', 'wind_speed_10m',
'wind_speed_100m', 'wind_gusts_10m', 'pressure_msl']]
y = multi_logit_data['outlier_type'] # Fix 1: use Series
# Scale features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
# Fit logistic regression
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=2000, class_weight='balanced') # Fix 2: more iterations
model.fit(X_train, y_train)
# Predictions and report
y_pred = model.predict(X_test)
report = classification_report(y_test, y_pred, zero_division=0)
print(report)
precision recall f1-score support
0 0.96 0.61 0.74 5511
3 0.12 0.64 0.20 437
6 0.50 1.00 0.67 1
7 0.38 1.00 0.55 113
accuracy 0.62 6062
macro avg 0.49 0.81 0.54 6062
weighted avg 0.89 0.62 0.70 6062
Class 0: High precision (0.96) → Most predicted class 0s were correct.
Low recall (0.61) → Many actual class 0s were misclassified.
This suggests the model is underpredicting class 0 or confusing it with minority classes.
Class 3: Very low precision (0.12) → Most predicted class 3s were actually another class.
High recall (0.64) → Many actual class 3s were found, but at the cost of high false positives.
Suggests class confusion, possibly because of feature overlap.
Class 6: Perfect recall/precision (1.0/0.50) but only 1 sample – not statistically meaningful.
Class 7: Moderate precision (0.38) and perfect recall (1.00) – model finds all class 7 cases but includes many false positives.
Accuracy: 62% — misleading due to class imbalance.
Macro average F1: 0.54 — shows performance is poor on minority classes.
Weighted avg F1: 0.70 — dominated by the majority class (0).
Logistic regression might be too rigid for capturing complex relationships.
Alternatives:
RandomForestClassifier (robust, handles imbalance better)
XGBoost or LightGBM
GradientBoostingClassifier
Survival Analysis with (Hourly) Weather Data For Extreme Events¶
Survival analysis, a statistical methodology traditionally employed in fields like medicine and engineering, has found increasing application in the realm of meteorology. By treating weather events as "survival" times, researchers can gain valuable insights into their duration, frequency, and underlying factors.
One of the key challenges in applying survival analysis to weather data is the presence of censored events. Weather events often do not have a definitive endpoint, especially when the data collection period ends before the event concludes. This necessitates the use of survival analysis techniques that can handle censored observations, such as the Kaplan-Meier estimator.
Furthermore, weather patterns are influenced by a multitude of factors, including climate change, El Niño-Southern Oscillation, and local geographic conditions. These factors can be incorporated into survival models as time-varying covariates, providing a more nuanced understanding of the factors driving the duration of weather events.
Spatial dependencies also play a significant role in weather phenomena. Survival models can be extended to account for these dependencies, allowing for a more accurate representation of the spatial distribution of weather events.
Applications of survival analysis in weather data are diverse. For instance, researchers can use it to quantify the duration and frequency of extreme events like hurricanes, floods, and wildfires. This information can be invaluable for disaster management and risk assessment. Additionally, survival analysis can be employed to evaluate the impact of climate change on the occurrence and characteristics of weather events, aiding in climate adaptation planning.
The applied data set concerns hourly meteorological data focused on the New York City area.
survival_data = hourly_dataframe_extreme_clean[['date', 'temperature_2m', 'rain',
'wind_speed_10m','wind_speed_100m',
'wind_gusts_10m',
'pressure_msl', 'outlier_type']].copy()
survival_data['year'] = survival_data['date'].dt.year
survival_data['month'] = survival_data['date'].dt.month
survival_data['day'] = survival_data['date'].dt.day
survival_data['hour'] = survival_data['date'].dt.hour
# SRecreate timestamp from parts (ensures consistent precision)
survival_data['timestamp'] = pd.to_datetime(survival_data[['year', 'month', 'day', 'hour']])
survival_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 30309 entries, 0 to 30308 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 30309 non-null datetime64[ns, UTC] 1 temperature_2m 30309 non-null float32 2 rain 30309 non-null float32 3 wind_speed_10m 30309 non-null float32 4 wind_speed_100m 30309 non-null float32 5 wind_gusts_10m 30309 non-null float32 6 pressure_msl 30309 non-null float32 7 outlier_type 30309 non-null int64 8 year 30309 non-null int32 9 month 30309 non-null int32 10 day 30309 non-null int32 11 hour 30309 non-null int32 12 timestamp 30309 non-null datetime64[ns] dtypes: datetime64[ns, UTC](1), datetime64[ns](1), float32(6), int32(4), int64(1) memory usage: 2.1 MB
The Kaplan-Meier Estimator: A Tool for Weather Data Analysis¶
The Kaplan-Meier estimator (KME), a cornerstone of survival analysis (Stalpers and Kaplan 2018), has found applications beyond its traditional medical and engineering domains. In the field of meteorology, it can be employed to analyze the duration of weather events, such as heatwaves, cold spells, or droughts.
For event times $t_i$ being the times when an event (flood, rainfall, death, etc.) occurs. The KME estimator focuses only on such distinct event times, ignoring intervals where no events occur.The number of events $d_i$ being the count of events (flood, rainfall, death, etc.) occurring at each event time $t_i$.
The number at risk $n_i$ is the number of individuals (or areas) not yet experiencing the event or censored right before time $t_i$. It represents the group of individuals or areas who are at risk of experiencing the event at that specific time.
The KNME survival function is calculated as the product of survival probabilities over time:
$$\hat{S}(t) = \left(1 - \frac{n_1}{d_1}\right) \times \left(1 - \frac{n_2}{d_2}\right) \times \ldots \times \left(1 - \frac{n_k}{d_k}\right)$$The above estimator assumes that:
- Censoring is independent of survival times.
- Survival probabilities are constant between event times.
- The risk set is updated accurately to reflect censored observations.The Kaplan-Meier estimator provides a step function that estimates the probability of survival beyond a certain time. It accounts for censored data (subjects who are lost to follow-up or whose event time extends beyond the study period) by adjusting the risk set accordingly at each step.
$S(t)$ being the probability that life is longer than $t$, the general function:
$$\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)$$The survival curve will step down at each $t_i$ where the event ends; exhibiting the probability of an event continuing past a certain point. So, for $S(3)= 85$ , such conveys that 85% of particular event type lasts more than 3 days (months, etc.). The curve drops as more of the event type ends, conveying the likelihood of the event type persisting beyond each time point.
Weather events often exhibit characteristics that align well with the concepts of survival analysis. For instance, the duration of a heatwave can be considered a "survival time," and the event might be censored if it is still ongoing when the data collection period ends.
By applying the Kaplan-Meier estimator to weather data, researchers can:
Estimate the duration of weather events: Quantify the average length of heatwaves, cold spells, or other extreme events.
Compare the duration of events across different regions or time periods: Identify trends and variations in the persistence of weather phenomena.
Assess the impact of climate change: Examine how the duration of weather events has changed over time and whether there are discernible trends related to climate change.
Inform decision-making: Provide valuable insights for policymakers, emergency managers, and public health officials in planning and response to weather-related events.
The Kaplan-Meier estimator's ability to handle censored data is particularly valuable in weather analysis, as many events may not have a definitive endpoint within the study period. Additionally, the estimator can be used to create survival curves, which visually represent the probability of a weather event continuing beyond a certain duration.
The Weibull Parametric Model in Survival Analysis for Weather¶
The Weibull parametric model has emerged as a powerful tool in the field of survival analysis, particularly for analyzing weather-related phenomena. Its versatility in modeling various distributions, including exponential, Rayleigh, and extreme value, makes it a valuable choice for understanding the time to occurrence of weather events such as storms, droughts, or heatwaves.
The Weibull distribution is characterized by two parameters: the shape parameter $(k)$ and the scale parameter $(\lambda)$. The shape parameter determines the overall shape of the distribution, while the scale parameter influences the location of the distribution along the time axis. When $k = 1$, the Weibull distribution reduces to the exponential distribution, which is often used to model the time between events in a Poisson process.
In the context of Weibull parametric model theory (Li, Marcuss and Russell 2024), considering the accelerated failure time (AFT) model:
$$Y=\text {log}(T)=\mu+\alpha\,Z+\sigma\epsilon$$where $T$ is the survival time, $\mu$ is the intercept, $Z$ is an $n$ by $p$ matrix, having $n$ as the number of samples and $p$ as the number of predictors/covariates; $\alpha$ is the coefficient of the predictors. $\epsilon$ is a random error term assumed to follow the extreme value distribution. for the Weibull distribution there is an additional parameter $\sigma$ which scales $\epsilon$. Let
$$\gamma = \frac{1}{\sigma}$$$$\lambda=e^{-\frac{\mu}{\sigma}}$$$$\beta=-\frac{\alpha}{\sigma}$$Then to have the Weibull model with a baseline hazard of:
$$h(x|z)=(\gamma\lambda\,t^{\gamma-1})e^{\beta\,Z}$$where $\gamma$ is the shape parameter, and $\lambda$ is the scale parameter. The hazard ratio (HR) is defined as:
$$HR=e^{\beta}$$The Weibull model has found numerous applications in weather analysis. One such application is in the study of storm duration. By fitting the Weibull model to historical data on storm durations, researchers can estimate the probability of a storm lasting a certain duration. This information can be invaluable for emergency planning and disaster response.
Another important application of the Weibull model is in the analysis of the time between extreme weather events. This can help researchers understand the frequency and intensity of these events, such as droughts or heatwaves. By identifying patterns in the timing of extreme events, researchers can gain insights into the underlying factors driving their occurrence.
In addition to analyzing storm durations and extreme events, the Weibull model can also be used to study the failure time of weather-related equipment. This information can be used to optimize maintenance schedules and ensure the reliability of weather data. For example, by analyzing the failure times of weather sensors, researchers can determine the optimal frequency of inspections and repairs.
Finally, the Weibull model can be used to analyze extreme values of weather variables, such as temperature or precipitation. This can help identify and quantify extreme events that may pose significant risks to society and infrastructure. By understanding the probability of extreme events, researchers can develop strategies for mitigating their impacts.
The Weibull model offers several advantages that make it a valuable tool for weather analysis. One of its key advantages is its flexibility. The Weibull model can accommodate a wide range of distributions, making it suitable for modeling various weather phenomena.
Another advantage of the Weibull model is its ease of interpretation. The parameters of the Weibull model have clear interpretations, making it easier to understand the results of the analysis. This makes the model accessible to researchers and practitioners with varying levels of statistical expertise.
Furthermore, the Weibull model can be used for statistical inference, such as hypothesis testing and confidence interval estimation. This allows researchers to draw conclusions about the underlying population based on the sample data.
While the Weibull model is a powerful tool, it is important to be aware of its limitations and considerations. One limitation of the Weibull model is that it assumes that the hazard rate function is monotonic. If this assumption is violated, the model may not provide accurate results.
Another factor to consider is the quality of the data used in the analysis. The accuracy of the Weibull model depends on the quality of the data. Incomplete or biased data can lead to misleading results.
Finally, it is important to consider the appropriateness of the Weibull model for the specific weather phenomenon being studied. In some cases, other parametric or nonparametric models may be more suitable. It is essential to carefully consider the characteristics of the data and the research objectives when selecting a model.
Survival Analysis during the Hurricane Season¶
The Atlantic hurricane season spans from June 1st to November 30th, also coincides with the rainy season for Montserrat. Now, to develop and observe survival analysis for such season.
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter, WeibullFitter
# Filter to hurricane season months (June to November) and create a fresh copy
survival_data = survival_data[survival_data['month'].between(6, 11)].copy()
# Define season start (June 1st midnight) for each year
survival_data['season_start'] = pd.to_datetime(survival_data['year'].astype(str) + '-06-01 00:00:00')
# Calculate duration in hours from season start
survival_data['duration'] = (survival_data['timestamp'] - survival_data['season_start']).dt.total_seconds() / 3600
# Ensure strictly positive durations (for Weibull model)
survival_data['duration'] = survival_data['duration'].apply(lambda x: x + 1e-6 if x <= 0 else x)
# Define binary event flag (1 = extreme event, 0 = normal)
survival_data['event'] = (survival_data['outlier_type'] > 0).astype(int)
# Assign time column
time_column = survival_data['duration']
# Clear previous figures
plt.clf()
plt.cla()
plt.close('all')
# === Kaplan-Meier Survival Curve ===
kmf = KaplanMeierFitter()
kmf.fit(durations=time_column, event_observed=survival_data['event'])
plt.figure(figsize=(10, 6))
kmf.plot_survival_function()
plt.title("Kaplan-Meier Survival Curve for Extreme Weather Events")
plt.xlabel("Hours Since June 1st")
plt.ylabel("Survival Probability")
plt.grid(True)
plt.show()
# === Weibull Parametric Survival Curve ===
wf = WeibullFitter()
wf.fit(durations=time_column, event_observed=survival_data['event'])
plt.figure(figsize=(10, 6))
wf.plot_survival_function()
plt.title("Weibull Survival Curve for Extreme Weather Events")
plt.xlabel("Hours Since June 1st")
plt.ylabel("Survival Probability")
plt.grid(True)
plt.show()
NOTE: the results above are only representative of the applied data (time range and place).
Interpretation¶
Kaplan-Meier
The Kaplan-Meier survival curve provided offers a visual representation of the probability of surviving (i.e., not experiencing) an extreme weather event over time. This statistical tool is commonly used in survival analysis to assess the likelihood of an event occurring within a specific timeframe.
A key observation from the curve is its general downward slope with downward concavity, indicating a decreasing probability of survival over time. This is expected, as the longer the observation period, the greater the chance of encountering an extreme weather event. Initially, the curve starts at a very high probability (close to 1.000), suggesting a low likelihood of such events at the beginning of the study period. However, as time progresses, the curve gradually slopes downward, indicating an increasing risk of experiencing an extreme weather event.
The shaded area around the curve represents the confidence interval, which indicates the range of possible survival probabilities. A narrower confidence band suggests greater certainty in the estimate, while a wider band indicates more uncertainty. In this case, the relatively narrow confidence bands suggest that the estimates are reasonably reliable.
Based on these observations, we can infer that:
The study period began with a low probability of experiencing an extreme weather event.
Over time, the risk of such events increased.
The uncertainty in the estimates is relatively low.
Weibull
The provided Weibull survival curve offers a visual representation of the probability of surviving (i.e., not experiencing) an extreme weather event over time. This statistical tool is commonly used in survival analysis to model the distribution of failure times, in this case, the occurrence of extreme weather events. The observations:
The curve shows a general downward slope, indicating a decreasing probability of survival over time. This is expected, as the longer the observation period, the greater the chance of encountering an extreme weather event.
The blue line represents the Weibull estimate, which is a parametric model that fits a specific probability distribution to the data. In this case, the Weibull distribution is used to model the time to occurrence of extreme weather events.
The shaded area around the curve represents the confidence interval, which indicates the range of possible survival probabilities. A wider confidence band suggests greater uncertainty in the estimate.
To now apply survival analysis with the Kaplan-Meier model:
Model Output Statistics (MOS) With Random Forest¶
Model Output Statistics (MOS) is a statistical technique used to calibrate the output of a numerical weather prediction (NWP) model. It involves training a statistical model on historical data to relate the raw NWP model output to observed values. This calibration can improve the accuracy and reliability of weather forecasts.
Random forest is a popular machine learning algorithm that can be used for both classification and regression tasks. When applied to weather forecasting, random forest can be used to predict various meteorological variables, such as temperature, precipitation, and wind speed.
To incorporate MOS with a random forest model for weather forecasting, the following steps are generally involved:
- Data Preparation:
Collect historical data for both the NWP model output and observed values. Ensure that the data is aligned in terms of time and location. Consider preprocessing the data, such as handling missing values or outliers.
- Random Forest Training:
Train a random forest model using the historical data. The input features of the model would be the NWP model output variables, and the target variable would be the corresponding observed values.
- MOS Calibration:
Once the random forest model is trained, apply it to the NWP model output to obtain calibrated forecasts. The calibrated forecasts are the output of the random forest model, which have been adjusted based on the historical relationship between the NWP model output and observed values.
Benefits of MOS with Random Forest:
Improved Accuracy: MOS can help to correct systematic biases in the NWP model output, leading to more accurate forecasts.
Enhanced Reliability: MOS can improve the reliability of forecasts, especially for extreme weather events.
Better Calibration: MOS can calibrate the probabilistic output of the NWP model, providing more accurate estimates of uncertainty.
Flexibility: Random forest is a flexible algorithm that can be applied to various weather variables and forecasting tasks.
The MOS Random Forest model aims to predict the corrected weather variable (e.g., precipitation) based on features derived from weather observations (e.g., temperature, humidity, pressure).
NOTE: due to time constraints, lack of resources and the sophistication of NWP models, an actual NWP model with its outputs will not be implemented. Rather, relying solely on a multivariate regression model structured on observed data (training observations) to compare with the unobserved (test set data).
To now acquire data to serve the MOS pursuit....
import openmeteo_requests
import pandas as pd
import requests_cache
from retry_requests import retry
# Setup the Open-Meteo API client with cache and retry on error
cache_session = requests_cache.CachedSession('.cache', expire_after = -1)
retry_session = retry(cache_session, retries = 5, backoff_factor = 0.2)
openmeteo = openmeteo_requests.Client(session = retry_session)
# Make sure all required weather variables are listed here
# The order of variables in hourly or daily is important to assign them correctly below
url = "https://archive-api.open-meteo.com/v1/archive"
params = {
"latitude": 16.7425,
"longitude": -62.1874,
"start_date": "2022-01-08",
"end_date": "2025-06-24",
"hourly": ["temperature_2m", "rain", "wind_speed_10m", "wind_speed_100m", "pressure_msl", "relative_humidity_2m", "dew_point_2m", "surface_pressure", "vapour_pressure_deficit", "boundary_layer_height", "cloud_cover_low", "cloud_cover_mid", "cloud_cover_high", "diffuse_radiation_instant"],
"timezone": "auto"
}
responses = openmeteo.weather_api(url, params=params)
# Process first location. Add a for-loop for multiple locations or weather models
response = responses[0]
print(f"Coordinates {response.Latitude()}°N {response.Longitude()}°E")
print(f"Elevation {response.Elevation()} m asl")
print(f"Timezone {response.Timezone()}{response.TimezoneAbbreviation()}")
print(f"Timezone difference to GMT+0 {response.UtcOffsetSeconds()} s")
# Process hourly data. The order of variables needs to be the same as requested.
hourly = response.Hourly()
hourly_temperature_2m = hourly.Variables(0).ValuesAsNumpy()
hourly_rain = hourly.Variables(1).ValuesAsNumpy()
hourly_wind_speed_10m = hourly.Variables(2).ValuesAsNumpy()
hourly_wind_speed_100m = hourly.Variables(3).ValuesAsNumpy()
hourly_pressure_msl = hourly.Variables(4).ValuesAsNumpy()
hourly_relative_humidity_2m = hourly.Variables(5).ValuesAsNumpy()
hourly_dew_point_2m = hourly.Variables(6).ValuesAsNumpy()
hourly_surface_pressure = hourly.Variables(7).ValuesAsNumpy()
hourly_vapour_pressure_deficit = hourly.Variables(8).ValuesAsNumpy()
hourly_boundary_layer_height = hourly.Variables(9).ValuesAsNumpy()
hourly_cloud_cover_low = hourly.Variables(10).ValuesAsNumpy()
hourly_cloud_cover_mid = hourly.Variables(11).ValuesAsNumpy()
hourly_cloud_cover_high = hourly.Variables(12).ValuesAsNumpy()
hourly_diffuse_radiation_instant = hourly.Variables(13).ValuesAsNumpy()
hourly_data = {"date": pd.date_range(
start = pd.to_datetime(hourly.Time(), unit = "s", utc = True),
end = pd.to_datetime(hourly.TimeEnd(), unit = "s", utc = True),
freq = pd.Timedelta(seconds = hourly.Interval()),
inclusive = "left"
)}
hourly_data["temperature_2m"] = hourly_temperature_2m
hourly_data["rain"] = hourly_rain
hourly_data["wind_speed_10m"] = hourly_wind_speed_10m
hourly_data["wind_speed_100m"] = hourly_wind_speed_100m
hourly_data["pressure_msl"] = hourly_pressure_msl
hourly_data["relative_humidity_2m"] = hourly_relative_humidity_2m
hourly_data["dew_point_2m"] = hourly_dew_point_2m
hourly_data["surface_pressure"] = hourly_surface_pressure
hourly_data["vapour_pressure_deficit"] = hourly_vapour_pressure_deficit
hourly_data["boundary_layer_height"] = hourly_boundary_layer_height
hourly_data["cloud_cover_low"] = hourly_cloud_cover_low
hourly_data["cloud_cover_mid"] = hourly_cloud_cover_mid
hourly_data["cloud_cover_high"] = hourly_cloud_cover_high
hourly_data["diffuse_radiation_instant"] = hourly_diffuse_radiation_instant
MOS_hourly_dataframe = pd.DataFrame(data = hourly_data)
print(MOS_hourly_dataframe)
Coordinates 16.76625633239746°N -62.20843505859375°E
Elevation 309.0 m asl
Timezone b'America/Montserrat'b'GMT-4'
Timezone difference to GMT+0 -14400 s
date temperature_2m rain wind_speed_10m \
0 2022-01-08 04:00:00+00:00 23.249001 0.0 28.146843
1 2022-01-08 05:00:00+00:00 22.598999 0.0 27.255590
2 2022-01-08 06:00:00+00:00 22.348999 0.0 30.498180
3 2022-01-08 07:00:00+00:00 21.848999 0.1 28.241076
4 2022-01-08 08:00:00+00:00 22.098999 0.1 29.215502
... ... ... ... ...
30331 2025-06-24 23:00:00+00:00 25.949001 0.0 43.795891
30332 2025-06-25 00:00:00+00:00 25.398998 0.0 43.793671
30333 2025-06-25 01:00:00+00:00 NaN NaN NaN
30334 2025-06-25 02:00:00+00:00 NaN NaN NaN
30335 2025-06-25 03:00:00+00:00 NaN NaN NaN
wind_speed_100m pressure_msl relative_humidity_2m dew_point_2m \
0 34.634918 1018.500000 71.679909 17.848999
1 33.466450 1018.299988 76.695610 18.299000
2 36.707645 1017.599976 76.176575 17.949001
3 34.743263 1017.500000 79.526360 18.148998
4 35.565376 1017.400024 80.060951 18.499001
... ... ... ... ...
30331 49.774147 1016.799988 75.106316 21.199001
30332 49.785542 1017.099976 81.488701 21.999001
30333 NaN NaN NaN NaN
30334 NaN NaN NaN NaN
30335 NaN NaN NaN NaN
surface_pressure vapour_pressure_deficit boundary_layer_height \
0 982.982544 0.807554 805.0
1 982.713318 0.638901 805.0
2 982.008240 0.643319 750.0
3 981.852722 0.536290 795.0
4 981.785583 0.530283 835.0
... ... ... ...
30331 981.655457 0.833751 1160.0
30332 981.881592 0.600079 1000.0
30333 NaN NaN NaN
30334 NaN NaN NaN
30335 NaN NaN NaN
cloud_cover_low cloud_cover_mid cloud_cover_high \
0 16.0 24.0 0.0
1 0.0 35.0 0.0
2 52.0 43.0 0.0
3 1.0 41.0 0.0
4 28.0 13.0 0.0
... ... ... ...
30331 58.0 0.0 100.0
30332 42.0 0.0 100.0
30333 NaN NaN NaN
30334 NaN NaN NaN
30335 NaN NaN NaN
diffuse_radiation_instant
0 0.0
1 0.0
2 0.0
3 0.0
4 0.0
... ...
30331 0.0
30332 0.0
30333 NaN
30334 NaN
30335 NaN
[30336 rows x 15 columns]
Some cleaning and probing of the data:
MOS_data = MOS_hourly_dataframe.dropna()
MOS_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 25965 entries, 0 to 30332 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 25965 non-null datetime64[ns, UTC] 1 temperature_2m 25965 non-null float32 2 rain 25965 non-null float32 3 wind_speed_10m 25965 non-null float32 4 wind_speed_100m 25965 non-null float32 5 pressure_msl 25965 non-null float32 6 relative_humidity_2m 25965 non-null float32 7 dew_point_2m 25965 non-null float32 8 surface_pressure 25965 non-null float32 9 vapour_pressure_deficit 25965 non-null float32 10 boundary_layer_height 25965 non-null float32 11 cloud_cover_low 25965 non-null float32 12 cloud_cover_mid 25965 non-null float32 13 cloud_cover_high 25965 non-null float32 14 diffuse_radiation_instant 25965 non-null float32 dtypes: datetime64[ns, UTC](1), float32(14) memory usage: 1.8 MB
Recalling that for continuous variables Pearson Corrrelation serves well for association among variables and to measure the level f linearity. Again, there is no rule about variable relations needing to be linear.
# Applying pearson correlation to the data set.
import matplotlib.pyplot as plt
import seaborn as sns
pearson_corr_hourly = MOS_data.corr(method = 'pearson')
# Generating correlation heatmap
plt.figure(figsize = (18, 14))
sns.heatmap(pearson_corr_hourly, annot = True, cmap = 'coolwarm')
plt.title('Pearson Correlation Heatmap for Hourly Data')
plt.savefig('heatmap.pdf', format='pdf')
plt.show()
Based on observations from the prior correlation heatmap, one can conclude that a basic OLS linear prediction model will not be adequate. One can deduce that most scatter plot pairs will not have linear characteristics. However, a quantile regression model will generally resolve the inadequacy of OLS models.
The Base Model Formula:
$$ Q_{y_i}(\tau \mid \mathbf{x}_i) = \mathbf{x}_i^\top \boldsymbol{\beta}(\tau) + \epsilon $$Advantages of This Approach:
Applying a quantile regression model as the base model instead of a complex NWP model reduces computational complexity. The simplest NWP models are the Barotropic model and the Baroclinical model. Such two models don't account for the target of interest. As well, data for the attributes of such two models can be quite elusive and tedious to wrangle into meaningful measurements. Additionally, the boundary conditions, appropriate parameters, relevant time scale and computational complexity are serious concerns; the ability to find a decent fit for montserrat can be extremely challenging. A regression model directly applies weather data generally without temporal considerations.
The random forest MOS model is capable of learning complex, nonlinear relationships in the errors of the base regression model, improving the overall forecast accuracy. This "adopted scheme" is easily scalable to forecast other weather variables without abstract mathematical physics equations.
In the MOS random forest setup the forest adapts to the coefficients of the regression model based on different weather situations (e.g., different temperature ranges or pressure levels, etc.).
Feature Selection¶
The target or response variable of concern is rain (fall), measured in millimeters. Recognising Meteorology or Climatology as serious fields of sustainable professional development the base model must be at least respectable concerning predictors or features. Hence, will apply feature selection as a preliminary step.
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# List of targets
targets = ['rain']
# Loop through each target
for target in targets:
print(f"\n{'='*60}\nAnalyzing Target: {target}\n{'='*60}")
# Define features: drop current target from targets list + use all other columns
possible_features = MOS_data.drop(columns=['rain', 'date'])
# Split data
X_train, X_test, y_train, y_test = train_test_split(
possible_features,
MOS_data[target],
test_size=0.2,
random_state=42
)
# Initialize model
rf_model = RandomForestRegressor(n_estimators=50, random_state=42)
# Fit model
rf_model.fit(X_train, y_train)
# Feature importances
importances = rf_model.feature_importances_
feature_importances = pd.DataFrame({
'Feature': X_train.columns,
'Importance': importances
}).sort_values(by='Importance', ascending=False)
# Plot feature importances
plt.figure(figsize=(12, 6))
plt.barh(feature_importances['Feature'], feature_importances['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.title(f'Feature Importances for Target: {target}')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# Print ranked features
print("Ranked Features based on Importance:")
print(feature_importances)
# Recursive Feature Elimination
rfe = RFE(estimator=rf_model, n_features_to_select=5)
rfe.fit(X_train, y_train)
selected_features = X_train.columns[rfe.support_]
print("Selected Features by RFE:")
print(selected_features.tolist())
============================================================ Analyzing Target: rain ============================================================
Ranked Features based on Importance:
Feature Importance
2 wind_speed_100m 0.144701
9 cloud_cover_low 0.126407
4 relative_humidity_2m 0.116968
7 vapour_pressure_deficit 0.107037
10 cloud_cover_mid 0.094817
6 surface_pressure 0.092521
1 wind_speed_10m 0.065555
8 boundary_layer_height 0.059934
3 pressure_msl 0.049230
12 diffuse_radiation_instant 0.036907
5 dew_point_2m 0.035584
11 cloud_cover_high 0.035510
0 temperature_2m 0.034828
Selected Features by RFE:
['wind_speed_100m', 'relative_humidity_2m', 'surface_pressure', 'vapour_pressure_deficit', 'cloud_cover_low']
Based on the feature selection operation and the correlation heat map from earlier, the above result is acceptable.
To now build a inexplicit base model and gauge its performance with unobserved data.
import pandas as pd
from sklearn.linear_model import QuantileRegressor
from sklearn.metrics import mean_absolute_error, mean_pinball_loss, accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
import numpy as np
# Define quantiles
quantiles = [0.25, 0.5, 0.75, 0.9]
models = {}
predictions = {}
# Identifying the features and the target
X = MOS_data[['wind_speed_100m', 'relative_humidity_2m',
'surface_pressure', 'vapour_pressure_deficit',
'cloud_cover_low']]
y = MOS_data['rain'] # Target as 1D array
# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train QuantileRegressor for each quantile
for q in quantiles:
model = QuantileRegressor(quantile=q, alpha=0, solver='highs')
model.fit(X_train, y_train)
models[q] = model
predictions[q] = model.predict(X_test)
# Median prediction
y_pred_median = predictions[0.5]
# 1. Regression Evaluation
mae = mean_absolute_error(y_test, y_pred_median)
pinball = mean_pinball_loss(y_test, y_pred_median, alpha=0.5)
print(f"--- Regression Evaluation (Median model) ---")
print(f"MAE: {mae:.3f}")
print(f"Pinball Loss (q=0.5): {pinball:.3f}")
# 2a. Classification Evaluation — Option 1: Lower percentile threshold (40th)
threshold_40 = np.percentile(y_train, 40)
y_class_true_40 = (y_test > threshold_40).astype(int)
y_class_pred_40 = (y_pred_median > threshold_40).astype(int)
acc_40 = accuracy_score(y_class_true_40, y_class_pred_40)
cm_40 = confusion_matrix(y_class_true_40, y_class_pred_40)
print(f"\n--- Classification Evaluation (Threshold = 40th percentile of y_train) ---")
print(f"Threshold value: {threshold_40:.3f}")
print(f"Accuracy: {acc_40:.3f}")
print("Confusion Matrix:")
print(cm_40)
# 2b. Classification Evaluation — Option 2: Model's predicted median threshold
threshold_pred = np.median(y_pred_median)
y_class_true_pred = (y_test > threshold_pred).astype(int)
y_class_pred_pred = (y_pred_median > threshold_pred).astype(int)
acc_pred = accuracy_score(y_class_true_pred, y_class_pred_pred)
cm_pred = confusion_matrix(y_class_true_pred, y_class_pred_pred)
print(f"\n--- Classification Evaluation (Threshold = median of predicted values) ---")
print(f"Threshold value: {threshold_pred:.3f}")
print(f"Accuracy: {acc_pred:.3f}")
print("Confusion Matrix:")
print(cm_pred)
--- Regression Evaluation (Median model) --- MAE: 0.089 Pinball Loss (q=0.5): 0.044 --- Classification Evaluation (Threshold = 40th percentile of y_train) --- Threshold value: 0.000 Accuracy: 0.749 Confusion Matrix: [[3891 0] [1302 0]] --- Classification Evaluation (Threshold = median of predicted values) --- Threshold value: 0.000 Accuracy: 0.749 Confusion Matrix: [[3891 0] [1302 0]]
Manually provide counts...
TN, FP, FN, TP = 3891, 1302, 0, 0
METRICS: ACCURACY = (TP + TN) / (TP + TN + FP + FN)
PRECISION = TP / (TP + FP) # How many predicted 1s are correct
RECALL = TP / (TP + FN) # How many actual 1s were captured
F1 = 2 * precision * recall / (precision + recall)
Accuracy: 0.749 74.9% of test samples are classified as "not above median", but...
Precision undefined (division by zero) Because no positives were predicted
Recall 0.000 Model predicted no positives — it missed all actual positive rainfall cases
F1 Score 0.000 No balance between precision and recall — this is a dead classifier
Of consequence, will revert to a model without feature selection.
# Identifying the features and the target
X = MOS_data.drop(columns = ['rain', 'date'])
y = MOS_data['rain'] # Target as 1D array
# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define quantiles
quantiles = [0.25, 0.5, 0.75, 0.95]
models = {}
predictions = {}
# 1. Train QuantileRegressor for each quantile
for q in quantiles:
model = QuantileRegressor(quantile=q, alpha=0, solver='highs')
model.fit(X_train, y_train)
models[q] = model
predictions[q] = model.predict(X_test)
# 2. Regression Evaluation (Example: for the median model)
y_pred_median = predictions[0.5]
mae = mean_absolute_error(y_test, y_pred_median)
pinball = mean_pinball_loss(y_test, y_pred_median, alpha=0.5)
print(f"--- Regression Evaluation (Median model) ---")
print(f"MAE: {mae:.3f}")
print(f"Pinball Loss (q=0.5): {pinball:.3f}")
# 3. Classification-style Evaluation
# Example: classify if true target is above or below the predicted median
# This mimics a binary classifier
y_class_true = (y_test > np.median(y_train)).astype(int) # True: above historical median
y_class_pred = (y_pred_median > np.median(y_train)).astype(int)
acc = accuracy_score(y_class_true, y_class_pred)
cm = confusion_matrix(y_class_true, y_class_pred)
print(f"\n--- Classification Evaluation (based on 50th percentile threshold) ---")
print(f"Accuracy: {acc:.3f}")
print("Confusion Matrix:")
print(cm)
--- Regression Evaluation (Median model) --- MAE: 0.088 Pinball Loss (q=0.5): 0.044 --- Classification Evaluation (based on 50th percentile threshold) --- Accuracy: 0.507 Confusion Matrix: [[1347 2544] [ 17 1285]]
Predicted
| 0 | 1 |
------------------------
True 0 | 1347 | 2544 |
True 1 | 17 | 1285 |
ANALYSIS --
Accuracy (50.7%) Barely better than flipping a coin — suggests the model is misclassifying a large number of observations.
Precision (33.6%) Only 1 in 3 predicted positives is actually correct. High false positive rate.
Recall (98.7%) Nearly all actual positives are correctly identified — very few false negatives.
F1 Score (50.1%) A moderate harmonic balance between precision and recall. Weighted toward recall due to high imbalance.
Strength -- High Recall (98.7%): The model is excellent at capturing actual positive cases (e.g., identifying risky, extreme, or high-priority instances).
Weakness -- Very Low Precision (33.6%): Most of the predicted positives are actually false. That’s a high false alarm rate.
This model behaves like a “better-safe-than-sorry” classifier:
1. It labels almost everything potentially risky (or “positive”).
2. Almost never misses a real positive, but triggers a lot of unnecessary alarms.
This is good if:
1. False negatives are dangerous/costly
E.g., detecting:
Floods
It’s problematic if:
False positives are expensive or disruptive
E.g., costly interventions, user alerts, wasted inspections, etc.
Of consequence, to proceed with a quantile regression model, but now to develop an explicit model...
import pandas as pd
from sklearn.model_selection import train_test_split
# Identifying the features and the target
X = MOS_data.drop(columns = ['rain', 'date'])
y = MOS_data['rain'] # Target as 1D array
# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Identify clearly the the model coefficients
import statsmodels.api as sm
# Standardization scales features so that they have a mean of 0 and a standard deviation of 1.
# Multiciollinearity is not a serious issue based on observation of the Pearson correlation matrix;
# All highly correlated pairs are resolved by dropping feat with lower feature importance.
# There are no serious near-linear dependencies in the predictors from observation of the Pearson correlation matrix.
# The meteorological features have considerable different scales however, so to standardize.
# Add constant (intercept)
X_with_const = sm.add_constant(X_train)
# Fit quantile regression model at quantile 0.5 (median)
quantile = 0.5
sm_model = sm.QuantReg(y_train, X_with_const)
result = sm_model.fit(q=quantile)
# Print summary with variable names
print(result.summary())
QuantReg Regression Results
==============================================================================
Dep. Variable: rain Pseudo R-squared: 0.01618
Model: QuantReg Bandwidth: 0.002483
Method: Least Squares Sparsity: 0.02279
Date: Fri, 27 Jun 2025 No. Observations: 20772
Time: 23:57:29 Df Residuals: 20758
Df Model: 13
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const 1.3545 0.238 5.702 0.000 0.889 1.820
temperature_2m -0.0542 0.009 -5.721 0.000 -0.073 -0.036
wind_speed_10m 0.0002 6.17e-05 3.586 0.000 0.000 0.000
wind_speed_100m -0.0002 5.29e-05 -3.883 0.000 -0.000 -0.000
pressure_msl -0.4430 0.076 -5.821 0.000 -0.592 -0.294
relative_humidity_2m 0.0011 0.000 6.527 0.000 0.001 0.001
dew_point_2m -0.0004 0.001 -0.534 0.594 -0.002 0.001
surface_pressure 0.4588 0.079 5.820 0.000 0.304 0.613
vapour_pressure_deficit 0.0327 0.005 6.990 0.000 0.023 0.042
boundary_layer_height -1.475e-06 6.73e-07 -2.194 0.028 -2.79e-06 -1.57e-07
cloud_cover_low 0.0002 4.8e-06 38.690 0.000 0.000 0.000
cloud_cover_mid 0.0019 4.84e-06 393.481 0.000 0.002 0.002
cloud_cover_high 2.031e-06 2.11e-06 0.964 0.335 -2.1e-06 6.16e-06
diffuse_radiation_instant 1.956e-05 1.14e-06 17.149 0.000 1.73e-05 2.18e-05
=============================================================================================
The condition number is large, 5.34e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
C:\Users\verlene\anaconda3\Lib\site-packages\statsmodels\regression\quantile_regression.py:191: IterationLimitWarning: Maximum number of iterations (1000) reached.
warnings.warn("Maximum number of iterations (" + str(max_iter) +
Dropping the features with poor p-values.
import pandas as pd
from sklearn.model_selection import train_test_split
# Identifying the features and the target
X = MOS_data.drop(columns = ['rain', 'date',
'dew_point_2m', 'cloud_cover_high',
'pressure_msl'])
y = MOS_data['rain'] # Target as 1D array
# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Identify clearly the the model coefficients
import statsmodels.api as sm
# Add constant (intercept)
X_with_const = sm.add_constant(X_train)
# Fit quantile regression model at quantile 0.5 (median)
quantile = 0.5
sm_model = sm.QuantReg(y_train, X_with_const)
result = sm_model.fit(q=quantile)
# Print summary with variable names
print(result.summary())
QuantReg Regression Results
==============================================================================
Dep. Variable: rain Pseudo R-squared: 0.01610
Model: QuantReg Bandwidth: 0.002377
Method: Least Squares Sparsity: 0.02295
Date: Fri, 27 Jun 2025 No. Observations: 20772
Time: 23:57:29 Df Residuals: 20761
Df Model: 10
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -0.0103 0.044 -0.236 0.814 -0.096 0.075
temperature_2m -0.0012 0.000 -6.498 0.000 -0.002 -0.001
wind_speed_10m 0.0003 6.1e-05 4.114 0.000 0.000 0.000
wind_speed_100m -0.0002 5.25e-05 -4.300 0.000 -0.000 -0.000
relative_humidity_2m 0.0008 0.000 6.605 0.000 0.001 0.001
surface_pressure -4.139e-05 4.39e-05 -0.944 0.345 -0.000 4.46e-05
vapour_pressure_deficit 0.0253 0.004 7.091 0.000 0.018 0.032
boundary_layer_height -1.973e-06 6.66e-07 -2.963 0.003 -3.28e-06 -6.68e-07
cloud_cover_low 0.0002 4.77e-06 39.313 0.000 0.000 0.000
cloud_cover_mid 0.0019 4.86e-06 390.480 0.000 0.002 0.002
diffuse_radiation_instant 1.956e-05 1.14e-06 17.234 0.000 1.73e-05 2.18e-05
=============================================================================================
The condition number is large, 6.93e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Will now build a model based on feature selection from earlier...
import pandas as pd
from sklearn.model_selection import train_test_split
# Identifying the features and the target
X = MOS_data[['wind_speed_100m', 'relative_humidity_2m', 'surface_pressure',
'vapour_pressure_deficit', 'cloud_cover_low']]
y = MOS_data['rain'] # Target as 1D array
# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Identify clearly the the model coefficients
import statsmodels.api as sm
# Add constant (intercept)
X_with_const = sm.add_constant(X_train)
# Fit quantile regression model at quantile 0.5 (median)
quantile = 0.5
sm_model = sm.QuantReg(y_train, X_with_const)
result = sm_model.fit(q=quantile)
# Print summary with variable names
print(result.summary())
QuantReg Regression Results
==============================================================================
Dep. Variable: rain Pseudo R-squared: -1.102e-06
Model: QuantReg Bandwidth: 0.01324
Method: Least Squares Sparsity: 0.02388
Date: Fri, 27 Jun 2025 No. Observations: 20772
Time: 23:57:30 Df Residuals: 20766
Df Model: 5
===========================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------------
const -5.36e-07 0.044 -1.22e-05 1.000 -0.086 0.086
wind_speed_100m -4.012e-09 9.69e-06 -0.000 1.000 -1.9e-05 1.9e-05
relative_humidity_2m 5.062e-08 3.88e-05 0.001 0.999 -7.6e-05 7.61e-05
surface_pressure -4.344e-09 4.4e-05 -9.87e-05 1.000 -8.62e-05 8.62e-05
vapour_pressure_deficit 1.081e-06 0.001 0.001 0.999 -0.002 0.002
cloud_cover_low 6.18e-08 4.71e-06 0.013 0.990 -9.17e-06 9.29e-06
===========================================================================================
The condition number is large, 5.23e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
NOTE: A low pseudo R² doesn't always mean the model is bad — especially for complex, noisy phenomena like precipitation concerning hourly increments. Rainfall is highly stochastic — hard to model with high R².
rain_data = MOS_data[['date', 'rain']]
rain_data.info()
<class 'pandas.core.frame.DataFrame'> Index: 25965 entries, 0 to 30332 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 25965 non-null datetime64[ns, UTC] 1 rain 25965 non-null float32 dtypes: datetime64[ns, UTC](1), float32(1) memory usage: 507.1 KB
Some basic time series analysis or decomposition.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
import statsmodels.api as sm
# Set the 'date' column as the DataFrame index
rain_data = rain_data.set_index('date')
# Ensure the data is sorted by date
rain_data = rain_data.sort_index()
# For time series analysis, it's often good practice to resample to a regular frequency.
# Here, we resample to a daily frequency and fill any missing values with 0.
# You might choose a different resampling frequency (e.g., 'W' for weekly, 'M' for monthly)
# depending on the nature of your rain data and the seasonality you expect.
rain_data_hourly = rain_data['rain'].resample('h').sum().fillna(0)
# --- Time Series Decomposition ---
# We use seasonal_decompose to break down the time series into trend, seasonal, and residual components.
# 'model' can be 'additive' or 'multiplicative'. 'additive' is suitable when the
# seasonal fluctuations are roughly constant over time. 'multiplicative' is for
# when they change proportionally to the level of the series.
# 'period' is the number of observations in a cycle. For hourly data with weekly seasonality, period=24.
# Adjust 'period' based on your data's primary seasonality (e.g., 12 for monthly data with yearly seasonality).
try:
decomposition = seasonal_decompose(rain_data_hourly, model='additive', period=24)
# Plotting the decomposition
fig, axes = plt.subplots(4, 1, figsize=(12, 10), sharex=True)
axes[0].plot(decomposition.observed)
axes[0].set_ylabel('Observed')
axes[0].set_title('Time Series Decomposition of Rain Data')
axes[1].plot(decomposition.trend)
axes[1].set_ylabel('Trend')
axes[2].plot(decomposition.seasonal)
axes[2].set_ylabel('Seasonal')
axes[3].plot(decomposition.resid)
axes[3].set_ylabel('Residual')
axes[3].set_xlabel('Date')
plt.tight_layout(rect=[0, 0.03, 1, 0.96]) # Adjust layout to prevent title overlap
plt.show()
except Exception as e:
print(f"Error during time series decomposition: {e}")
print("Please check if your time series data has enough observations for the chosen 'period'.")
print("For example, if period=7, you need at least 14 data points for decomposition to work well.")
# --- LOWESS (Locally Weighted Scatterplot Smoothing) ---
# LOWESS is a non-parametric regression method that fits a series of local linear regressions
# to smooth a scatter plot. It's great for visualizing the trend in noisy data.
# 'frac' parameter: controls the smoothness. It's the fraction of data used when estimating
# each local regression. Smaller frac = less smooth, larger frac = more smooth.
# Typical values are between 0.1 and 0.8. Adjust based on how much smoothing you need.
lowess_smoothed = sm.nonparametric.lowess(rain_data_hourly.values, rain_data_hourly.index.astype(np.int64), frac=0.1)
# Convert the output back to a DataFrame with datetime index for easier plotting
lowess_df = pd.DataFrame(lowess_smoothed, columns=['date_int', 'smoothed_rain'])
lowess_df['date'] = pd.to_datetime(lowess_df['date_int'])
lowess_df = lowess_df.set_index('date')
lowess_df = lowess_df.sort_index()
# Plotting LOWESS smoothing
plt.figure(figsize=(12, 6))
plt.plot(rain_data_hourly.index, rain_data_hourly, label='Original Rain Data', alpha=0.7)
plt.plot(lowess_df.index, lowess_df['smoothed_rain'], color='red', linewidth=2, label='LOWESS Smoothed')
plt.title('LOWESS Smoothing of Rain Data')
plt.xlabel('Date')
plt.ylabel('Rain')
plt.legend()
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()
Based on seasonal component the data seems to genuinely lacks daily seasonality. The hourly values of 'rain' don’t show a consistent pattern each day.
Tests For Seasonality¶
Will pursue seasonality check by spectral analysis with periodogram, and provide mention on the Ljung-Box test.
Spectral Analysis or Periodogram¶
Spectral analysis, particularly using the periodogram, is based on the Fourier transform of a time series. It quantifies how the variance (power) of a signal is distributed across different frequencies. The mathematical structure and intuition behind it:
1. Signal Orientation
Let $x(t)$ be a real-valued time series sampled at regular intervals, where:
$ t = 1, 2,..., N - 1 $ (discrete time steps);
Sample interval is $ \Delta\,t$ (1 hour);
Total duration being, $ T = N \times \Delta\,t $
2. Discrete Fourier Transform
Let $x(t)$ be a discrete time series with $N$ samples. The Discrete Fourier Transform (DFT) is defined as:
$$X(f_k) = \sum_{t=0}^{N-1} x(t) \cdot e^{-2\pi i f_k t}$$where the discrete frequency $( f_k)$ is given by:
$$f_k = \frac{k}{N \Delta t}, \quad \text{for } k = 0, 1, 2, \dots, N-1 $$The corresponding periodogram (power spectral density estimate) is:
$$P(f_k) = \frac{1}{N} \left| X(f_k) \right|^2$$3. Periodogram Definition
The periodogram estimates the power spectral density (PSD) of $x(t)$ as:
$$P(f_k) = \frac{1}{N}\,\,\left|X(f_k)\right|^2$$This represents how the power (variance) of the time series is distributed across frequencies $f_k$
Units of $P(f_k)$ depend on the units of $x(t)$; often variance per unit frerequency.
For real-valued time series, the periodogram is symmetric around the Nyquist frequency $f = \frac{1}{2\Delta\,t}$.
4. Interpretation of Peaks
A peak in $P(f_k)$ indicates a dominant cycle of period $T_k = \frac{1}{f_k}$.
Example: a strong peak at $f_k = \frac{1}{24}$ (cycles/hour) corresponds to a 24-hour cycle.
5. Continuous-Time Analogy
In continuous time, the Power Spectral Density (PSD), $S(f)$ is defined via the Weiner-Khinchin theorem:
$$S(f) = \int_{-\infty}^{\infty} R(\tau) \, e^{-2\pi i f \tau} \, d\tau$$where $R(\tau) = \mathbb{E}[x(t) x(t+\tau)]$ is the autocorrelation function of the process $x(t)$, and $f$ is the frequency (in Hz).
So, spectral analysis or periodogram concerns testing for hidden cycles or periodic structure.
A flat power spectrum (white noise) supports randomness.
Dominant Peaks suggest non-random periodic behavior.
import numpy as np
from scipy.signal import periodogram
import matplotlib.pyplot as plt
# Acquiring a time and attribute domain or dataframe
rain_series_test = MOS_data[['date', 'rain']]
# Extract rain values as 1D array
rain_series = rain_series_test['rain'].values
# Set sampling frequency: 1/hour (since data is hourly)
fs = 1 # samples per hour
# Compute periodogram
f, Pxx = periodogram(rain_series, fs=fs)
# Exclude zero frequency (DC component)
f = f[1:]
Pxx = Pxx[1:]
# Convert frequency to period (hours and days)
period_hours = 1 / f
period_days = period_hours / 24
# Plot periodogram with period axis (in days)
plt.figure(figsize=(14, 6))
plt.semilogy(period_days, Pxx)
plt.title("Periodogram of Hourly Rainfall in Montserrat")
plt.xlabel("Period (Days)")
plt.ylabel("Power")
plt.grid(True)
plt.xscale('log')
plt.axvline(1, color='red', linestyle='--', label='Daily cycle (1d)')
plt.axvline(7, color='green', linestyle='--', label='Weekly cycle (7d)')
plt.axvline(30, color='orange', linestyle='--', label='Monthly cycle (30d)')
plt.axvline(180, color='purple', linestyle='--', label='Seasonal cycle (180d)')
plt.legend()
plt.show()
NOTE: for peaks that greatly stand out they are designated as strong peaks, to identify observed cycles w.r.t. to the associated time. From observation, peaks appear to sluggishly become more identifiable when approaching the seasonal duration. However, there are other peaks comparable to the one that's well aligned with the seasonal cycle point. The peak at seasonal cycle period is dominant compared to the two preceeding peaks (but it's there).
Ljung-Box Test¶
The Ljung-Box test is a statistical test designed to detect whether a time series exhibits significant autocorrelation at lags up to a specified maximum lag $h$. It is widely used to evaluate whether residuals from a time series model resemble white noise. The test improves upon the Box-Pierce statistic by applying a small-sample correction.
- NULL AND ALTERNATIVE HYPOTHESES --
$H_0$: the data are independent (no autocorrelation up to a lag $h$)
$H_a$: The data are not independent (at least one autocorrelation is non-zero up to lag $h$)
- TEST STATISTIC --
Have:
$n$: number of observations
$h$: number of lags tested
$r_k$: sample autocorrelation at lag $k$
Then the Ljung-Box test statistic is:
$$ Q = n(n+2) \sum_{k=1}^{h} \frac{r_k^2}{n - k} $$- DISTRIBUTION --
Under $H_0$, the test statistic approximately follows a chi-squared distribution with $h$ degrees of freedom:
$$ Q \sim \chi^2(h) $$The corresponding p-value is computed as:abs
$$p = \mathbb{P}(\chi^2_h > Q) $$- INTERPRETATION --
If $p < \alpha$ (e.g., 0.05), reject $H_0 $: Evidence of autocorrelation.
If $p \geq \alpha$, fail to reject $H_0$: No significant autocorrelation detected.
- APPLICATIONS --
When applied to residuals from a fitted model, it checks for model adequacy.
When applied to raw time series data, a significant result (autocorrelation) at seasonal lags may suggest seasonality, but is not conclusive on its own.
Ljung-Box Test Suggesting Seasonality¶
The case of interest:
Downsample to daily data,
Run Ljung-Box test at lag = 365 (1 year),
Getting a significant p-value,
Observing a peak in ACF at lag 365,
Then it supports seasonality at yearly frequency. Namely, it suggests seasonality, but does not provide confirmation.
NOTE: the beneath code is placed in commentary form because autocorrelation related computations can be quite computationally expensive; hourly observations ranging from 2022 to 2025 are applied.
# from statsmodels.stats.diagnostic import acorr_ljungbox
# Resample to daily rainfall totals
# rain_daily = rain_series.resample('D').sum()
# Such above reduces noise and lets you test for annual cycles more feasibly.
# results = acorr_ljungbox(rain_daily, lags=[365], return_df=True)
# The above line to test at annual lag (daily data)
# print(results)
# If p-value < 0.05, this suggests autocorrelation at the yearly lag.
NOTE: earlier, the seasonal component of the times series was observed, which suggested possible sporadic behaviour. As well, spectral analysis or periodiogram was developed earlier, which conveyed a peak at the seasonal cycle, but isn't domininant (but it's there). Then, running the Ljung-Box Test, and if a p-value < 0.05 is acquired, such may further strengthen the position of declaring sporadic or random rain fall behaviour.
Nevertheless, to continue with MOS development based on feature selection.
Mathematical Structure for the MOS Random Forest Model¶
The adopted base model is a quantile linear regression to predict the target variable $y$ based in input features $X$.
1. Multivariate Quantile Regression Model as the Base Model:
Input Features: $X = [X_1\,X_2\,...,\,X_n]$, where $X_i$ is the $i$-th feature.
Coefficients: $\beta = [\beta_1\,\beta_2\,...,\,\beta_n]$, representing the relationship between each feature and the target.
Prediction Function:
$$Q_y(\tau|X) = \beta_0 + \sum_{i=1}^n \beta_i X_i$$where:
$Q_y(\tau|X)$ is the conditional $\tau$-quantile of $y$ w.r.t. $X$ and $\beta$ as the base model.
$\beta_0$ is the intercept.
The residuals (errors) are computed as:
$$r_{\text{train}} = y_{\text{train}} - \hat{Q_y}(\text{train})$$$$r_{\text{test}} = y_{\text{test}} - \hat{Q_y}(\text{test})$$$y_{\text{train}}$ and $y_{\text{test}}$ are the actual observed values for the target training and test data, respectively.
$\hat{Q_y}(\text{train})$ and $\hat{Q_y}(\text{test})$ are the predictions from the base Quantile regression model.
2. Residual Modeling with Random Forest (MOS Model):
To rectify for errors made by the base model, a random forest is trained on the residuals of the base model predictions. The notion is that random forest can capture non-linearities and complex interactions between the features that the regression model will not account for.
Random Forest Model for Residuals:
The input to the random forest model is stil the same feature set $X$, however, the target is now the residuals from the base model:
$$\hat{r} = f_{\text{RF}}(X)$$where:
$f_{\text{RF}}$ is the random forest function trained to predict residuals $r_{\text{train}}$ on the training data.
Such above model learns a non-linear mapping between the features and the residuals.
3. Final Corrected Prediction:
The final corrected prediction is acquired by addition the residual corrections from the random forest model to the base predictions.
Final Prediction:
$$\hat{Qy}_{\text{final}} = \hat{Qy}_{\text{base}} + \hat{r}$$where:
$\hat{y}_final$ is the final prediction (with corrections);
$\hat{y}_{\text{base}}$ is the prediction from the base model;
$\hat{r}$ is the correction (residual prediction) from the random forest model.
For the test set, such becomes:
$$\hat{Qy}_{\text{final},\,\text{test}} = \hat{Qy}_{\text{base},\,\text{test}} + \hat{r}_{\text{test}}$$4. Mean Squared Error (MSE) for Model Evaluation:
Model performance is evaluated using MSE, measuring the average squared distance between the actual values and the predicted values.
Base Model MSE:
$$\text{MSE}_{\text{base}} = \frac{1}{m} \sum_{i=1}^m \left(y_i - \hat{y}_{\text{base}} \right)^2$$MOS Model (Corrected) MSE:
$$\text{MSE}_{\text{MOS}} = \frac{1}{m} \sum_{i=1}^m \left(y_i - \hat{y}_{\text{final}} \right)^2$$5. Feature Importance in Random Forest:
The feature importance from the random forest is a measure of how much each feature contributes to reducing the variance of the residuals:
Feature Importance Score: $I(X_i)$ representing how much the feature $X_i$ reduces the model's error. Such can be observed in a bar plot to comprehend which features are the most important for predicting the residuals (namely, the errors the base model missed).
6. Forecasting Using MOS:
Base Forecast:
$$\hat{Qy}_{\text{base}} = \beta_0 + \sum_{i=1}^n \beta_i X_i$$MOS Corrected Forecast:
$$y_{\text{MOS corrected}} = \hat{Qy}_{\text{base}} + f_{\text{RF}}(X)$$Such formulation conveys how the multivariate linear regression model forms the basis of prediction, while the random forest provides a second layer of refinement by capturing non-linear behaviors. The overall approach amplifies prediction accuracy by combining the strengths of both models.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import QuantileRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Identifying the features and the target
X = MOS_data[['wind_speed_100m', 'relative_humidity_2m', 'surface_pressure',
'vapour_pressure_deficit', 'cloud_cover_low']]
y = MOS_data['rain'] # Target as 1D array
# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the Base (Quantile Regression) Model
base_model = QuantileRegressor(quantile = 0.5) # for median
base_model.fit(X_train, y_train)
# Generate base predictions on the training and test data
base_train_preds = base_model.predict(X_train)
base_test_preds = base_model.predict(X_test)
# Establish residuals (errors) between actual values and base model predictions
train_residuals = y_train - base_train_preds
test_residuals = y_test - base_test_preds
# Train the random forest MOS model to predict the residuals
rf_mos = RandomForestRegressor(n_estimators= 100, random_state=42)
rf_mos.fit(X_train, train_residuals)
# NOTE: the test sets should be viewed as new data.
# Predict the residuals on the test data using the random forest MOS model
mos_residual_corrections = rf_mos.predict(X_test)
# Final corrected predictions = base model predictions + MOS corrections
final_predictions = base_test_preds + mos_residual_corrections
# Evaluate the model using mean squared error (MSE)
base_mse = mean_squared_error(y_test, base_test_preds)
mos_mse = mean_squared_error(y_test, final_predictions)
print(f"Base Regression Model MSE: {base_mse}")
print(f"MOS Random Forest Model Corrected MSE: {mos_mse}")
# Step 11: Visualize Feature Importance for the Random Forest Model
def visualize_feature_importance(model, X):
if hasattr(model, 'feature_importances_'):
feature_importance = model.feature_importances_
features = X.columns
plt.figure(figsize=(10, 6))
plt.barh(features, feature_importance, color='skyblue')
plt.xlabel("Importance")
plt.ylabel("Features")
plt.title("Feature Importances from MOS Random Forest Model")
plt.show()
else:
print("The model does not provide feature importances.")
# Visualize feature importance for the Random Forest model
visualize_feature_importance(rf_mos, pd.DataFrame(X_train, columns = X.columns))
# Forecast Using the MOS Model
# Using test data (X_test) as new data for forecasting
# Base forecast using the regression model
base_forecast = base_model.predict(X_test)
# MOS corrections using the random forest model
mos_corrections = rf_mos.predict(X_test)
# Final forecast (Base forecast + MOS corrections)
final_forecast = base_forecast + mos_corrections
print("Comparing Observed and Final Forecast:")
# Combine into a DataFrame
df = pd.DataFrame({'Observed Data': y_test, 'Final Forecast': final_forecast})
df
Base Regression Model MSE: 0.15434815878494398 MOS Random Forest Model Corrected MSE: 0.1231500823204265
Comparing Observed and Final Forecast:
| Observed Data | Final Forecast | |
|---|---|---|
| 26331 | 0.1 | 0.479 |
| 2533 | 0.9 | 0.191 |
| 2929 | 0.0 | 0.108 |
| 13114 | 1.1 | 0.035 |
| 6835 | 0.0 | 0.006 |
| ... | ... | ... |
| 433 | 0.1 | 0.063 |
| 30261 | 0.0 | 0.002 |
| 24617 | 0.0 | 0.076 |
| 5836 | 0.0 | 0.072 |
| 22791 | 0.0 | 0.000 |
5193 rows × 2 columns
Ensemble Forecast Models¶
Weather forecasting, once a realm of educated guesswork, has evolved into a complex science aided by powerful computational tools. Among these, ensemble forecast models have emerged as indispensable instruments for predicting weather patterns with greater accuracy and uncertainty quantification.
Ensemble forecasting is a statistical method that involves running multiple simulations of a weather model with slightly different initial conditions and/or model parameters. This approach recognizes the inherent uncertainty in weather prediction, arising from the chaotic nature of atmospheric dynamics and the limitations of observation networks. By generating a range of possible outcomes, ensemble models provide a more comprehensive picture of the potential weather scenarios, allowing forecasters to assess the likelihood of various events and communicate uncertainty effectively.
The key components of an ensemble forecast model include:
- Initial Conditions: These are the starting points for each simulation, derived from observations of atmospheric variables like temperature, pressure, humidity, and wind speed at different locations.
- Model Physics: The underlying equations that describe the physical processes governing atmospheric behavior, such as advection, convection, radiation, and precipitation.
- Perturbations: Small variations introduced to the initial conditions and/or model parameters to create different ensemble members.
- Ensemble Size: The number of individual simulations within the ensemble. Larger ensembles generally provide better statistical representation of uncertainty.
Ensemble models offer several advantages over traditional single-run forecasts:
- Uncertainty Quantification: By generating a range of possible outcomes, ensemble models provide a measure of the forecast's reliability. This helps forecasters communicate uncertainty effectively to the public and decision-makers.
- Improved Skill: Ensemble forecasts often exhibit better skill than single-run forecasts, especially for rare or extreme events. This is because they can capture the variability associated with such events more accurately.
- Early Warning: Ensemble models can provide early warnings of potential severe weather events, allowing for timely preparation and mitigation measures.
- Climate Applications: Ensemble models are used to study climate variability and change, providing insights into long-term trends and potential impacts.
However, ensemble forecasting is not without its challenges. One limitation is the computational cost associated with running multiple simulations. As models become more complex and the number of ensemble members increases, the computational requirements can be substantial. Additionally, the effectiveness of ensemble forecasts depends on the quality of the initial conditions and the accuracy of the model physics. Errors in either of these can lead to degraded forecast performance.
Despite these challenges, ensemble forecast models have become an essential tool for modern weather prediction. By providing a more comprehensive and probabilistic view of the weather, they help forecasters make informed decisions and communicate uncertainty effectively to the public. As computational capabilities continue to advance, we can expect further improvements in ensemble forecasting, leading to even more accurate and reliable weather predictions.
Various literature (Warner 2010; Muschinski et al 2023) have provided foundations and modelling for the implementation of weather forecast ensemble models. Now, to demonstrate "kindergarten" level development of ensemble models.
An Ensemble Model Based Purely on Random Forests¶
Due to limitations involving the scope of model physics concerning appropriate parameters, boundary conditions, time orientation, and also having reluctance to delve into computational complexity, a pure machine learning environment will now be adopted. Additionally, basic regression models (multi linear or quantile regression) don't seem to be highly suited to the data applied. To now observe the performance of a random forest model by itself.
Data Preparation¶
The goal is to prepare a dataset for the ensemble weather forecast model representating, representing historical weather data as an array. In a professional and constructive environment data structure technologies have been well established. In the Python language environment such concerns incorporation of the NumPy and Pandas libraries for computation and manipulation of data. The idea of an array or matrix considered:
$$X = \begin{pmatrix} x_{11} & x_{12} & \cdots & x_{1m} \\ x_{21} & x_{22} & \cdots & x_{2m} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \cdots & x_{nm} \end{pmatrix}$$where $x_{ij}$ represents the $j$-th weather variable (such as temperature, humidity, etc.) at the $i$-th time step. So, columns represent variables (features or predictors), and rows represent time steps or observations through time; datetime format is a standard convention. To strictly identify the configuration as an array rather than a matrix for linear operations because linear relationships between the features are not strongly observed, recalling the Pearson correlation heat map.
Model Development¶
Introduce predictive models $f$ to learn about the residing relationships between the target of interest and the applied features. If $X$ represents input features and $y$ represents the target, then:
$$\hat{y} = f(X, \theta)$$where $\hat{y}$ is the predicted output, and $\theta$ are the model parameters optimized during training. Supurvised models such as regression and ensemble models such as random forests are trained.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import QuantileRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
# Identifying the features and the target
X = MOS_data[['wind_speed_100m', 'relative_humidity_2m', 'surface_pressure',
'vapour_pressure_deficit', 'cloud_cover_low']]
y = MOS_data['rain'] # Target as 1D array
# Train-test split procedure
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize models
models = {
'quantile_regression': QuantileRegressor(),
'random_forest': RandomForestRegressor(n_estimators=100, random_state=42)
}
# Train models
for name, model in models.items():
model.fit(X_train, y_train)
Concering the prior development, for random forests, it's an ensemble of decision trees, where, each tree $T_i$ learns a sub-model:
$$\hat{y} = \frac{1}{B}\sum_{i=1}^{B} T_i (X)$$NOTE: the above is not a linear analytical function.
Generating Ensemble Forecasts¶
Ensemble members are created by pertubing the initial conditions or the features slightly. For a based model $f$:
$$\hat{y}=f(X+\epsilon_i)$$where $\epsilon_i$ is a small perturbation for ensemble member $i$.
# Create ensemble members
n_members = 10
ensemble_predictions = []
for i in range(n_members):
# Perturb the test set
X_test_perturbed = X_test + np.random.normal(0, 0.05, X_test.shape)
predictions = models['random_forest'].predict(X_test_perturbed)
ensemble_predictions.append(predictions)
# Convert to numpy array for easier calculations
ensemble_predictions = np.array(ensemble_predictions)
The above script simulates the generation of $n$ ensembles, each identifying a slightly unique atmospheric state by adding noise $\epsilon_i$ to the input features.
Generating Ensemble Forecasts¶
The ensemble mean and spread summarize the ensemble’s central tendency and uncertainty:
$$\bar{X} = \frac{1}{N} \sum_{i=1}^{N} X_i$$$$\sigma = \sqrt{\frac{1}{N-1} \sum_{i=1}^{N} (X_i - \bar{X})^2}$$ensemble_mean = np.mean(ensemble_predictions, axis=0)
ensemble_spread = np.std(ensemble_predictions, axis=0)
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(y_test.values, label='Observed', color='black')
plt.plot(ensemble_mean, label='Ensemble Mean', color='blue')
plt.fill_between(range(len(y_test)),
ensemble_mean - ensemble_spread,
ensemble_mean + ensemble_spread,
color='blue', alpha=0.3, label='Ensemble Spread')
plt.legend()
plt.title('Ensemble Forecast with Spread')
plt.show()
Probabilistic Forecast and Evaluation¶
The CRPS (Continuous Ranked Probability Score) evaluates the ensemble forecast by comparing the distribution of ensemble forecasts against the observed value:
$$\text {CRPS} = \int_{-\infty}^{\infty} (F(X) - H(X))^2 \, dX$$where
$F(X)$ is the cumulative distrbutiuon function (CDF) of the forecast;
$H(X)$ is the CDF of the observed outcome.
print(y_test.values.shape)
print(ensemble_predictions.shape)
(5193,) (10, 5193)
ensemble_predictions = ensemble_predictions.T
from properscoring import crps_ensemble
# Compute CRPS
crps = crps_ensemble(y_test.values, ensemble_predictions)
print(f'CRPS: {np.mean(crps)}')
CRPS: 0.09997477047952492
For the above, CRPS assesses how well the probabilistic distribution of the ensemble matches the observed value, indicating the accuracy and reliability of the forecast.
If the forecast perfectly predicts the observed outcome $F(X)$ will be identical to $H(X)$ for all values of $X$. In such as case the CRPS will be 0.
Otherwise the CRPS will be higher. Higher CRPS values conveys are larger discrepancy between the predicted and outcome distributions.
Visualization and Analysis¶
The concluding step is to visualize the distribution of the ensemble members to better comprehend their spread and variability:
plt.figure(figsize=(12, 6))
for i in range(n_members):
sns.kdeplot(ensemble_predictions[i], alpha=0.3, warn_singular = False)
plt.axvline(y_test.values.mean(), color='black', linestyle='--', label='Observed Mean')
plt.title('Density of Ensemble Members')
plt.legend()
plt.show()
By plotting the KDE (Kernel Density Estimate) of each ensemble member, we visualize how the forecast probabilities are distributed around the observed mean, which provides insights into the uncertainty and reliability of the forecast.
The Intersection of Data Science and Meteorology: A Powerful Partnership¶
The synergy between data science and meteorology has given rise to a new era of weather forecasting and climate analysis. By leveraging techniques such as data processing, statistical programming, data wrangling, exploratory data analysis, time series analysis, and machine learning, researchers and meteorologists are unlocking valuable insights from vast datasets.
Data processing and wrangling form the foundation of this partnership, ensuring that raw meteorological data is cleaned, standardized, and transformed into a usable format. Statistical programming languages like Python and R provide the tools to manipulate, analyze, and visualize this data effectively. Exploratory data analysis helps identify patterns, trends, and anomalies within the data, guiding further investigations.
Time series analysis is particularly crucial for meteorological data, as it often exhibits temporal dependencies. Time series algorithms like Prophet can capture these dependencies and make accurate predictions. Machine learning algorithms, such as local outlier factor, multilinear regression, quantile regression, logistic regression, random forests, and analysis offer powerful tools for modeling complex relationships between meteorological variables. Additionally, extreme value analysis and survival analysis also have meaningful application with meteorological data.
Conclusion¶
This project has demonstrated the potential of data wrangling, exploratory data analysis (EDA), statistical analysis, stochastic models and machine learning to extract valuable insights from historical meteorological data. By employing a range of such tools and techniques there was ability to visualize trends, uncover hidden relationships and characteristics within the data. Such development provided some foundation to explore temporal dependencies, forecast future trends, climate standing, extreme conditions, probability of outcomes, and develop weather prediction models.
The applied data from government agencies, along with Open-Meteo API and the Kaggle repository proved to be valuable resources for this project, offering a vast repository of high-quality historical weather data.
Overall, this project highlights the importance of leveraging advanced programming and analysis techniques to better understand climate data, patterns and improve our ability to predict future weather events. By applying all such knowledge and skills with robust datasets, one can gain valuable insights that can inform decision-making in various fields, such as agriculture, energy, climate preparedness, and disaster management.
References¶
Anderson, G.B., Bell, M.L. and Peng, R.D. (2013). Methods to Calculate the Heat Index as an Exposure Metric in Environmental Health Research. Environ Health Perspect 121:1111–1119; https://doi.org/10.1289/ehp.1206273
ECMWF. (2025, July 10). ERA5 Hourly Data on Single Levels From 1940 to Present. https://cds.climate.copernicus.eu/datasets/reanalysis-era5-single-levels?tab=download
Extreme Value Analysis. Met Office. (n.d.). https://www.metoffice.gov.uk/services/research-consulting/weather-climate-consultancy/extreme-value-analysis
Forecasting at Scale. Prophet. (n.d.). https://facebook.github.io/prophet/
Goel MK, Khanna P, Kishore J. Understanding Survival Analysis: Kaplan-Meier Estimate. Int J Ayurveda Res. 2010 Oct;1(4):274-8. doi: 10.4103/0974-7788.76794. PMID: 21455458; PMCID: PMC3059453.Hamdi, Y., Haigh, I. D., Parey, S., and Wahl, T.: Preface: Advances in Extreme Value analysis and Application to Natural Hazards, Nat. Hazards Earth Syst. Sci., 21, 1461–1465, https://doi.org/10.5194/nhess-21-1461-2021, 2021.
Hayes, A. (2019). How the Wilcoxon Test Is Used. Investopedia. https://www.investopedia.com/terms/w/wilcoxon-test.asp
Hayes, Adam. (2022). What Is a Times Series and How Is It Used to Analyze Data? Investopedia. https://www.investopedia.com/terms/t/timeseries.asp
Hersbach, H., Bell, B., Berrisford, P., Biavati, G., Horányi, A., Muñoz Sabater, J., Nicolas, J., Peubey, C., Radu, R., Rozum, I., Schepers, D., Simmons, A., Soci, C., Dee, D., Thépaut, J-N. (2023). ERA5 hourly data on single levels from 1940 to present [Data set]. ECMWF. https://doi.org/10.24381/cds.adbb2d47
Historical Hurricane Tracks. Climate Mapping for Resilience and Adaptation. (n.d.). https://resilience.climate.gov/datasets/fedmaps::historical-hurricane-tracks/about
Koenker, R., & José A. F. Machado. (1999). Goodness of Fit and Related Inference Processes for Quantile Regression. Journal of the American Statistical Association, 94(448), 1296–1310. https://doi.org/10.2307/2669943
Koenker, R. and Hallock, K. F. (2001). Quantile Regression. Journal of Economic Perspectives—Volume 15, Number 4—Pages 143–156
Localoutlierfactor. scikit. (n.d.). https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html
Li, X., Marcus, D. and Russell, J. et al. (2024). Weibull Parametric Model for Survival Analysis in Women with Endometrial Cancer using Clinical and T2-Weighted MRI Radiomic Features. BMC Med Res Methodol 24, 107 (2024). https://doi.org/10.1186/s12874-024-02234-1
MacFarland, T.W., Yates, J.M. (2016). Mann–Whitney U Test . In: Introduction to Nonparametric Statistics for the Biological Sciences Using R. Springer, Cham. https://doi.org/10.1007/978-3-319-30634-6_4
MacKinnon, J.G. 1994 “Approximate Asymptotic Distribution Functions for Unit-Root and Cointegration Tests.” Journal of Business & Economics Statistics, 12.2, 167-76.
MacKinnon, J.G. 2010. “Critical Values for Cointegration Tests.” Queen”s University, Dept of Economics Working Papers 1227. http://ideas.repec.org/p/qed/wpaper/1227.html
Muñoz Sabater, J. (2019). ERA5-Land hourly data from 2001 to present [Data set]. ECMWF. https://doi.org/10.24381/CDS.E2161BAC
Muschinski , T. et al (2023). Robust Weather-Adaptive Post-Processing using Model Output Statistics Random Forests. Nonlinear Processes in Geophysics, 30, 503–514. https://doi.org/10.5194/npg-30-503-2023
NCEI. (n.d.). Storm Events Database. National Centers for Environmental Information. https://www.ncdc.noaa.gov/stormevents/listevents.jsp?eventType=ALL&beginDate_mm=09&beginDate_dd=01&beginDate_yyyy=2023&endDate_mm=12&endDate_dd=31&endDate_yyyy=2023&county=ALL&hailfilter=0.00&tornfilter=0&windfilter=000&sort=DT&submitbutton=Search&statefips=36%2CNEW%2BYORK
NEON (National Ecological Observatory Network). Shortwave and Longwave Radiation (Net Radiometer) (DP1.00023.001), Provisional Data. Dataset accessed from https://data.neonscience.org/data-products/DP1.00023.001 on November 24, 2024
NOAA. (2017, January 20). Hurricanes and Typhoons, 1851-2014. Kaggle. https://www.kaggle.com/datasets/noaa/hurricane-database/data
NOAA Predicts Above-Normal 2024 Atlantic Hurricane Season. National Oceanic and Atmospheric Administration. (n.d.). https://www.noaa.gov/news-release/noaa-predicts-above-normal-2024-atlantic-hurricane-season
NWS (National Weather Service). 2011. Meteorological Conversions and Calculations: Heat Index Calculator. Available: https://www.wpc.ncep.noaa.gov/html/heatindex.shtml [accessed 02 October 2024]
Open-Meteo. (2022). Historical Weather API. Historical Weather API. https://open-meteo.com/en/docs/historical-weather-api
Perktold, J., & Seabold, S. (n.d.). Stationarity and Detrending (ADF/KPSS) - statsmodels 0.14.1. https://www.statsmodels.org/stable/examples/notebooks/generated/stationarity_detrending_adf_kpss.html
Perktold, J., & Seabold, S. (n.d.). Statsmodels.tsa.stattools.coint - statsmodels 0.15.0 (+270). https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.coint.htmlerg
Saffir-Simpson Hurricane Wind Scale. (n.d.). https://www.nhc.noaa.gov/aboutsshws.php
Sarmento, D.(n.d.). Chapter 22: Correlation Types and When to Use Them. https://ademos.people.uic.edu/Chapter22.html
Schimanke S., Ridal M., Le Moigne P., Berggren L., Undén P., Randriamampianina R., Andrea U., Bazile E., Bertelsen A., Brousseau P., Dahlgren P., Edvinsson L., El Said A., Glinton M., Hopsch S., Isaksson L., Mladek R., Olsson E., Verrelle A., Wang Z.Q. (2021). CERRA sub-daily regional reanalysis data for Europe on single levels from 1984 to present [Data set]. ECMWF. https://doi.org/10.24381/CDS.622A565A
Stalpers, L. J. A., & Kaplan, E. L. (2018). Edward L. Kaplan and the Kaplan-Meier Survival Curve. BSHM Bulletin: Journal of the British Society for the History of Mathematics, 33(2), 109–135.https://doi.org/10.1080/17498430.2018.1450055
Stigler, M. (2020). Chapter 7 - Nonlinear Time sSeries in R: Threshold cointegration with tsDyn. In: Handbook of Statistics. Elsevier, Volume 42, 2020, Pages 229-264. https://doi.org/10.1016/bs.host.2019.01.008
The Comprehensive R Archive Network. (n.d.). https://cran.r-project.org/web/packages/prophet/vignettes/quick_start.html
Warner, T. T. (2010). Ensemble Methods. In: Numerical Weather and Climate Prediction (pp. 252–283). Chapter 7, Cambridge: Cambridge University Press.
WMO Meteorological Codes. WMO meteorological codes. (n.d.). https://artefacts.ceda.ac.uk/badc_datadocs/surface/code.html
Zippenfenig, P. (2023). Open-Meteo.com Weather API [Computer software]. Zenodo. https://doi.org/10.5281/ZENODO.7970649
5.2 Smoothing Time Series: Stat 510. PennState: Statistics Online Courses. (n.d.). https://online.stat.psu.edu/stat510/lesson/5/5.2